csgillespie / poweRlaw

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.
109 stars 24 forks source link

bootstrap_p with lognormal dist. returns checkForRemoteErrors(val), unexpectedly #78

Closed yigit-hub closed 5 years ago

yigit-hub commented 6 years ago

Dear Mr. Gillespie I have an offline desktop at the statistical agency and use poweRlaw to the private data set summarized as follows (na values are also omitted):

year | Min | 1st Q | Median | Mean | 3rd Q | Max | n
2016 | 20 | 25 | 35 | 86 | 61 | 29,309 | 122,233
2015 | 20 | 26 | 37 | 93 | 66 | 27,852 | 110,963
2014 | 20 | 25 | 37 | 99 | 67 | 139,576 | 108,839
2013 | 20 | 25 | 35 | 83 | 58 | 70,679 | 118,974
2012 | 20 | 25 | 34 | 79 | 56 | 31,760 | 96,936
2011 | 20 | 25 | 34 | 86 | 56 | 83,700 | 86,855
2010 | 20 | 26 | 36 | 85 | 61 | 30,724 | 69,221
2009 | 20 | 30 | 44 | 112 | 82 | 101,386 | 48,162
2008 | 20 | 25 | 34 | 100 | 52 | 64,000 | 79,841
2007 | 20 | 27 | 38 | 95 | 66 | 43,555 | 56,537
2006 | 20 | 25 | 35 | 88 | 59 | 59,146 | 56,850
2005 | 20 | 25 | 34 | 93 | 59 | 40,393 | 58,218

My work environment:

Windows 7, i5, 4GB RAM, R 3.5.1, R Studio 1.1.456, poweRlaw 0.70.1, VGAM 1.0-6, rtools3.5 
(all newly downloaded and installed from file)

I initiate the package over 12 years' data as documented. compare_distributions works fine for all. Using bootstrap_p on lognormal distribution, while I get the p-value for some years, 2013-2012-2008-2005 data return the following error with different Gb values:

library("poweRlaw")
m_ln = dislnorm$new(v2013)
est = estimate_xmin(m_ln)
m_ln$setXmin(est)

bs_p = bootstrap_p(m_ln, no_of_sims = 1, xmins = seq(140,160,2), threads = 4, seed = 1)
---time estimation message as usual here---
Error in checkForRemoteErrors(val) : four nodes produced an error: cannot allocate vector of size 
405.6 Gb  

This incredible vector size goes up towards hundreds of thousands Gb depending on the inputs I provide to the _bootstrapp such as _no_ofsims, or xmins. This issue was raised here 3 years ago but not solved. I have checked: -All my data are separate vectors and I call them separately (no loops added by me). -Object sizes for data vectors are in hundreds of Kbs and similar to each other. -It is strange that only a few of them and only for lognormal dist. returns the error. All power law p-values are calculated. -I have found a suggession here under Memory Load section but could not utilize it.

I'm sorry that I cannot provide data for reproduction due to the regulations. Hope this get things work.

Thanks for this super package and your attention.

csgillespie commented 5 years ago

Sorry for the late response.

We've just updating how the lognormal is estimated. Perhaps try the latest version. Also, to reproduce, I'll need a copy of the data