Open ttsesm opened 3 years ago
Hi there,
I haven't touched the codes for roughly 2 years (as I'm no longer in academia), but last time I ran the code, there were no such warnings. Probably that's caused by some Numpy/Scipy updates. I think some distributions/models didn't reach the optimum results, but for now let's pretend that they are fine.
The purpose of this package is to check the tail behavior, particularly its distribution, of a dataset, or say to answer the question "among the candidate models, which one could best describe the tail distribution of this dataset?" The general idea is to use some statistical techniques to get a conclusion. Here I relied on AIC for model selection, as suggested in "Power-law distributions in empirical data" (a well-cited manuscript).
For the output table, the first column represents where the tail starts; the second column is the model name; the third to the second last columns are the obtained parameters in corresponding models, and there might be several parameters in a model. The last column is the AIC, and during model comparison, that's the only column we need to focus on. Basically, the model/distribution with the smallest AIC is the best-fitted model. For this demo, the model "Pairwise Power Law" presented the smallest AIC, which is 105206, and we would call it the best model (among the candidate models) for describing the tail of the sample dataset.
But there is a crucial problem - what if none of the candidates is correct (or at least a good approximation)? That's why I further plot the distributions against the data for verification. Usually we draw histograms for this purpose, but here as we focus on fat tails, histograms could be noisy and misleading. Instead, I draw the surviving distributions, 1-CDF(x), which is more smooth and clear. From these plots, we could confirm that the pairwise power law is the best model among those candidates, and it does provide a good approximation based on our visual inspection.
We could use statistical tests (like K-S test) to replace the visual inspection, but as I focused on analyzing empirical "big" data, which are influenced by many factors (resulting in complex behaviors, such as small bumps on its distribution), they usually could not pass such statistical tests. So visual inspection is still the best option for the verification step.
If none of the candidate models could pass the rough visual inspection, we probably want to include a customized model into the candidates.
Hope that helps.
BTW, I think some of the statements above are in the appendix in one of the papers I listed in the readme (the first one I believe), and that's why I forgot to include them into the readme file. Sorry about that.
@XiangwenWang thanks for the feedback, it helps quite a lot since it is easier to understand the numbers and the output graphs. I will try to have a look in the papers as well (thanks for the links). If I still have any questions I will let you know.
You are welcome!
Hi, I was trying your code related to heavytailed distributions. So in my case I have a dataset which is heavily tailed, thus I used a sample of it with the code here and I got the following output:
and the following graphs:
from what I understand the pairwise power law seems to fit better on the input data, but I am not sure I understand what that means. Thus, I would appreciate if you could provide some feedback.
Thanks.