csgillespie / poweRlaw

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.
109 stars 24 forks source link

inflated p values in GOF, algorithm not as specified by Clauset et al. #52

Closed lsaravia closed 9 years ago

lsaravia commented 9 years ago

I made a lot of GOF test with different large data sets and all give me a p=1 which seems very improbable, so I revised the GOF algorithm. I found that you generate the synthetic data for x<xmin using a uniform distribution, instead of sampling from the original data (see below) this could cause the problem I found.

http://arxiv.org/pdf/0706.1062.pdf

Pag 17

"The generation of the synthetic data involves some subtleties. To obtain accurate estimates of p we need synthetic data that have a distribution similar to the empirical data below xmin but that follow the fitted power law above xmin. To generate such data we make use of a semiparametric approach. Suppose that our observed data set has ntail observations x ≥ xmin and n observations in total. We generate a new data set with n observations as follows. With probability ntail/n we generate a random number xi drawn from a power law with scaling parameter ˆα and x ≥ xmin. Otherwise, with probability 1 − ntail/n, we select one element uniformly at random from among the elements of the observed data set that have x < xmin and set xi equal to that element. Repeating the process for all i = 1 . . . n we generate a complete synthetic data set that indeed follows a power law above xmin but has the same (non-power-law) distribution as the observed data below."

csgillespie commented 9 years ago

That does sound ominous. At first glance, I think you are correct. I'll reread the paper, and fix ASAP.

csgillespie commented 9 years ago

I think I've fixed the issue (https://github.com/csgillespie/poweRlaw/blob/master/pkg/R/bootstrap_p.R). Any other comments welcome.

Many thanks again for the report and the solution.