csgillespie / poweRlaw

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.
109 stars 24 forks source link

How do you calculate KS in estimate_xmin.R #51

Closed lsaravia closed 9 years ago

lsaravia commented 9 years ago

We were looking at your functions to understand how they work and noticed that in the continuous case the function get_KS_statistic calculates the KS distance between the estimated distribution and a uniform cdf instead of the empirical cdf, which seems not quite right.

csgillespie commented 9 years ago

I agree it doesn't seem quite right. However, a quick simulation:

library("poweRlaw")
x = sort(rplcon(10, 1, 2))
m = conpl$new(x)
m$setPars(2)
q = m$dat
n = m$internal[["n"]]; N = length(q)
q = q[(N-n+1):N]
fit_cdf = dist_cdf(m, q)
data_cdf = ((0:(n-1))/n)[1:length(fit_cdf)]
tmp = ecdf(x)
tmp(x)

shows that it does the correct thing. I got the idea from the R/Matlab code on Clausett's website.

One small issue I seem to recall is that if you have duplicates in the data the empirical cdf isn't quite correct, but then I suppose if the data is really CTN you shouldn't get duplicates.

lsaravia commented 9 years ago

I checked the plpva.r code and does the same, I am not sure why, I think that is the way you calculate dist_cdf, I you uncomment the plots you will see the differences with the empirical cumulative distribution. I calculated the distances using the builting ks.test function and checked, not quite wrong. But the problem I had is with GOF, I will open another issue for that.

comp_ks <-sapply(1:50, function(x){
    x <- sort(rplcon(50, 1, 2))
    m <- conpl$new(x)
    m$setPars(2)
    q <- m$dat
    n <- m$internal[["n"]]; N = length(q)
    q <- q[(N-n+1):N]
    fit_cdf <- dist_cdf(m, q)
    data_cdf <- ((0:(n-1))/n)[1:length(fit_cdf)]
    tmp <- ecdf(x)

    #plot(tmp,col=5,pch=21)
    #points(tmp(x),col=2)
    #points(fit_cdf,col=3)
    #curve(pplcon(x,m$xmin,est$pars),min(x),max(x),add = T,col=4)

    ks <-ks.test(x,"pplcon",1,2)$statistic

    pl_ks <-max(abs(fit_cdf-data_cdf))

    c(ks,pl_ks)
})

plot(comp_ks[1,],comp_ks[2,])
csgillespie commented 9 years ago

For info I'm away for the next 10 days.

csgillespie commented 9 years ago

The difference is due to the comparison. Spefically, are we comparing with Pr(X > x) or with Pr(X \ge x).

In your example above, if use subtract a very small amount from with using the ecdf we get the same result.

data_cdf
tmp(x)
tmp(x-.Machine$double.eps)