csgillespie / poweRlaw

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.
109 stars 24 forks source link

Issue fitting param values for discrete lognormal distribution #85

Open JaydenM-C opened 5 years ago

JaydenM-C commented 5 years ago

Dear Dr Gillespie,

Firstly, thanks for maintaining this excellent package. It's been invaluable for my current research.

I have a set of count values to which I'm trying to fit a discrete lognormal distribution. However, when I run bootstrap_p, it gives a vector memory exhaustion error (seems to be the same as issue #78).

counts <- c(54, 64, 126, 161, 162, 278, 281, 293, 296, 302, 322, 348, 418, 511, 696, 793, 1894)

dist <- dislnorm$new(counts)
dist$setXmin(estimate_xmin(dist))

bs <- bootstrap_p(dist)
Expected total run time for 100 sims, using 1 threads is 24.6 seconds.
Error in checkForRemoteErrors(val) : 
  one node produced an error: vector memory exhausted (limit reached?)

This seems to come from the rejection sampling algorithm. It produces N random numbers to get n > xmin random numbers, but in my case, the fitted parameter values are so low that N needs to be extremely large to get anything above xmin and exceeds memory.

Fitting the continuous lognormal distribution to the same data seems to produce more reasonable parameter estimates, bootstraps successfully and gives a p value around 0.8:

dist_cont <- conlnorm$new(counts)
dist_cont$setXmin(estimate_xmin(dist_cont))

bs <- bootstrap_p(dist_cont)

I'm wondering if there's something in the likelihood function for the discrete lognormal distribution (or elsewhere in the parameter fitting pipeline) that might be causing extremely low parameter values to be fitted to my data. I know this example contains an extremely small sample size, but I'm curious since it seems to work okay when fitting the continuous lognormal distribution to the same data.

Kind regards, Jayden.

PS: I'm well aware this could be a problem with me and my lack of understanding rather than the package (actually this is much more likely!) so I wrote about this a bit more on StackOverflow seeking more general advice.

csgillespie commented 5 years ago

So this is sort of a bug, as in the error message isn't that helpful. However, if you had a very large computer the algorithm would work ;) But that's not reasonable.

As you deduced you're in the extreme tail of the lognormal, so my simple rejection sampler failed. But the fit for the discrete is in the extreme end - top 0.0000001% of the lognormal

The continuous (visually) gives the same fit, but more sensible parameter values.

The suspicion is the likelihood surface is very flat, so lots of parameter values will give the same fit.


To summarise, any fix I implement will be to handle the error a bit nicer, but I can't think of a nicer way of simulating random numbers.

JaydenM-C commented 5 years ago

Thanks so much for your explanation (both here and on StackOverflow). You've helped me understand what's going on much better. It's very much appreciated!

I agree with the fix (to the extent it's a 'bug' that needs to be 'fixed' at all). Although I certainly wouldn't expect it to be a high priority (if you were to bother at all), given it's only an issue that crops up in these really extreme cases where the data sucks and the model doesn't work anyway. I'm a very satisfied user of the package, especially with your helpful comments.

Thanks again!