csgillespie / poweRlaw

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.
109 stars 24 forks source link

Fitting a power-law with 0 #81

Closed phisanti closed 5 years ago

phisanti commented 5 years ago

I have been trying to fit a power-law distribution that contains 0 as a category. Looking at my data, I have many cases with no peaks and fewer and fewer cases with increasing peaks. If I try to fit the distribution on the data itself, I am forced to exclude the 0s as I get the message "Error in check_discrete_data(dat): Data should be strictly positive, i.e. no zeros." However, if I count the cases with a different function - see the table below - and input those counts for the fitting, the parameters that the model produces looks nothing like the "raw data" input.

Npeaks 
 1  3  0  0  0  0  2  1  5  3  0  0  5  3  4  3  1  1  1  7  4  5  1  3  4  0  3 0  5  3  1  1  5  3  0  0  4  0  6  0 1  2  1  1  1  2  4  0  0  2  4  1  1  0 0  1  2  2  9  3  0  0  8  0  0  0  0  2  4  0  6  2  0  1  8  0  5  4  2  0  0 1 2  1 14  0  5  2  3  1 10  1  7  2  3  2 13  8  6  0  9  3  1  5  3  1  1  0 0  2  0  3  5  1  1  1 11  0  0  0  0 1  0  7  1  2  0  2  0  0  0  4  0  0  0 2  0 10  0  1  4  3  0  2 11  0  1  2  0  6  0  0  0  6  0  2  0  4  1  2  2  0 0  1  1  0  1  1  0  1  0  1  0  3  3  2  0  0  0  2  0  1  2  0  0  0  1  0  0 0  7  0  0  1  0  0  3  0  0  1  0  0  2  0  0  1  1  1  0  0  0  0  2  0  0  0 1

x <- read.table("clipboard", sep = "" , header = F ,
                   na.strings ="", stringsAsFactors= F)
x <- as.vector(t(x)[,1])
estimate_xmin(displ$new(x[x>0])) # Case 1 - no 0, raw data
x <- table(x)
x
 0  1  2  3  4  5  6  7  8  9 10 11 13 14 
87 45 27 18 11  9  5  4  3  2  2  2  1  1 

estimate_xmin(displ$new(x)) # Case 2 - manual count, 0, I understand this should be different as now it includes the 0s
estimate_xmin(displ$new(x[2:length(x)]))  # Case 3 - manual count, no 0

I suppose case 1 and case 3 should look fairly similar, however, the results are completely different. Moreover, case 2 and 3 are more alike than 1 and 3. Hope you can help me to decide which approach is methodologically correct.

csgillespie commented 5 years ago

This

x <- table(x)
estimate_xmin(displ$new(x)) 

is wrong. What you are actually doing is

x <- table(x)
(y = as.vector(x))
length(y)
estimate_xmin(displ$new(y))

i.e. you are passing the frequencies as your data.


Regarding zeros, the discrete power-law isn't defined at 0. So either remove or remap your data, e.g.

estimate_xmin(displ$new(x[x>0])) 
estimate_xmin(displ$new(x + 1)) 

You now get very similar answers.