csgillespie / poweRlaw

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.
109 stars 24 forks source link

data compression for likelihood #79

Closed maxbiostat closed 5 years ago

maxbiostat commented 5 years ago

I was wondering whether for the discrete distributions implementing a compressed likelihood wouldn't speed up estimation. The basic idea is to multiply the likelihood of x by the frequency of x to avoid computing the same thing over and over. I have an implementation over on my fork. I'm happy to submit a PR for you to review and fix it to conform to the packages standards. Here is some supporting info:

# install.packages("poweRlaw")
# devtools::install_github("maxbiostat/poweRlaw")
library(poweRlaw)
data(moby)
mm <- dislnorm$new(moby)

counter <- 0
maxOp <- 100
system.time(
  repeat {
    counter <-  counter + 1
    estimate_pars(mm)
    if (counter == maxOp){
      break
    }
  }
)

#### Original implementation 
# user  system elapsed 
# 51.611   0.074  51.730  
# > estimate_pars(mm)
# $pars
# [1] -3.582645  2.760856
# 
# $value
# [1] 40117.15
# 
# $counts
# function gradient 
# 18       18 
# 
# $convergence
# [1] 0
# 
# $message
# [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
# 
# attr(,"class")
# [1] "estimate_pars"

#### Compressed implementation
# user  system elapsed 
#  2.289   0.023   2.312 
# > estimate_pars(mm)
# $pars
# [1] -3.582645  2.760856
# 
# $value
# [1] 40117.15
# 
# $counts
# function gradient 
# 18       18 
# 
# $convergence
# [1] 0
# 
# $message
# [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
# 
# attr(,"class")
# [1] "estimate_pars"
csgillespie commented 5 years ago

Thanks. A PR would be great, with a couple of minor caveats:

maxbiostat commented 5 years ago

Squashing done. Please take your time to review this. Thanks for implementing these techniques in an open-source package. Least I can do is contribute a little bit. I'm still a bit puzzled why pulling the frequencies and values from internal wouldn't work; maybe I'm doing something wrong. I reckon you planned to have some sort of compression, and that's why internal includes freq and values.

csgillespie commented 5 years ago

@maxbiostat Thanks for the PR #80. I've merged and made a few changes.

One of my plans is for the discrete distributions to be able to handle much larger datasets. This means avoiding using the underlying data and only using freq/values, i.e. the data compression idea.

Happy for further PR on other discrete distributions. Thanks