Help for explaining the output

ttsesm commented 3 years ago

Hi, I was trying your code related to heavytailed distributions. So in my case I have a dataset which is heavily tailed, thus I used a sample of it with the code here and I got the following output:

xmin=125    Power Law   1.44857249  116024
xmin=125    Exponential 0.00078316  106812
.../scipy/optimize/optimize.py:282: RuntimeWarning: Values in x were outside bounds during a minimize step, clipping to bounds
  warnings.warn("Values in x were outside bounds during a "
xmin=125    Log-Normal  7.01483688 1.01 106695
xmin=125    Pairwise Power Law  1e-06 3.5953322 0.07234651  105206
xmin=125    Power Law With Exp Cutoff   2.0 0.0001  122655
xmin=125    Yule    1.44910929  116018
xmin=125    Poisson 1401.41099181   3986280
.../heavytailed/poisson.py:49: RuntimeWarning: overflow encountered in exp
  normfactor = 1. / np.exp(mu + np.log(1 - sp_pois.cdf(xmin - 1, mu)))
.../heavytailed/poisson.py:52: RuntimeWarning: overflow encountered in exp
  total -= np.exp(x * np.log(mu) - gammaln(x + 1)) * normfactor
.../heavytailed/poisson.py:52: RuntimeWarning: invalid value encountered in double_scalars
  total -= np.exp(x * np.log(mu) - gammaln(x + 1)) * normfactor

and the following graphs:

log_normal exponential pairwise_power_law poisson power_law_with_exp_cutoff yule

from what I understand the pairwise power law seems to fit better on the input data, but I am not sure I understand what that means. Thus, I would appreciate if you could provide some feedback.

Thanks.

XiangwenWang commented 3 years ago

Hi there,

I haven't touched the codes for roughly 2 years (as I'm no longer in academia), but last time I ran the code, there were no such warnings. Probably that's caused by some Numpy/Scipy updates. I think some distributions/models didn't reach the optimum results, but for now let's pretend that they are fine.

The purpose of this package is to check the tail behavior, particularly its distribution, of a dataset, or say to answer the question "among the candidate models, which one could best describe the tail distribution of this dataset?" The general idea is to use some statistical techniques to get a conclusion. Here I relied on AIC for model selection, as suggested in "Power-law distributions in empirical data" (a well-cited manuscript).

For the output table, the first column represents where the tail starts; the second column is the model name; the third to the second last columns are the obtained parameters in corresponding models, and there might be several parameters in a model. The last column is the AIC, and during model comparison, that's the only column we need to focus on. Basically, the model/distribution with the smallest AIC is the best-fitted model. For this demo, the model "Pairwise Power Law" presented the smallest AIC, which is 105206, and we would call it the best model (among the candidate models) for describing the tail of the sample dataset.

But there is a crucial problem - what if none of the candidates is correct (or at least a good approximation)? That's why I further plot the distributions against the data for verification. Usually we draw histograms for this purpose, but here as we focus on fat tails, histograms could be noisy and misleading. Instead, I draw the surviving distributions, 1-CDF(x), which is more smooth and clear. From these plots, we could confirm that the pairwise power law is the best model among those candidates, and it does provide a good approximation based on our visual inspection.

We could use statistical tests (like K-S test) to replace the visual inspection, but as I focused on analyzing empirical "big" data, which are influenced by many factors (resulting in complex behaviors, such as small bumps on its distribution), they usually could not pass such statistical tests. So visual inspection is still the best option for the verification step.

If none of the candidate models could pass the rough visual inspection, we probably want to include a customized model into the candidates.

Hope that helps.

BTW, I think some of the statements above are in the appendix in one of the papers I listed in the readme (the first one I believe), and that's why I forgot to include them into the readme file. Sorry about that.

ttsesm commented 3 years ago

@XiangwenWang thanks for the feedback, it helps quite a lot since it is easier to understand the numbers and the output graphs. I will try to have a look in the papers as well (thanks for the links). If I still have any questions I will let you know.

XiangwenWang commented 3 years ago

You are welcome!

XiangwenWang / heavytailed

Help for explaining the output #1