Closed YYsong closed 7 years ago
See page 4 of the article for a discussion about the mathematics of heavy-tailed distributions. If that is not sufficient, see these references:
=========================================== Regarding the 40% rule of thumb, the figure on page 3 that you referenced is regarding a discussion about heavy-tailed distributions in general and how they compare to more traditionally used distributions (e.g. Gaussians), not about the implementation details of the algorithm.
The 40% rule is a simplification of the actual algorithm that works well in practice. The algorithm stops when the head group is no longer characterized by a heavy-tailed distribution. So, if 40% or so of the data is in the head (60% in the tail) after the split, then the data are most likely not heavy-tailed distributed. Though you could certainly make this much more precise, it often does not matter in practice.
I have a question after reading this paper and your code. I can't find any mathematical definition for a heavy-tailed distribution in the paper. But there is a picture in page 3 illustrates only 10 percent data values in the head. So why do you want to set the threshold to 0.4?
thx