data compression for likelihood

maxbiostat commented 5 years ago

I was wondering whether for the discrete distributions implementing a compressed likelihood wouldn't speed up estimation. The basic idea is to multiply the likelihood of x by the frequency of x to avoid computing the same thing over and over. I have an implementation over on my fork. I'm happy to submit a PR for you to review and fix it to conform to the packages standards. Here is some supporting info:

# install.packages("poweRlaw")
# devtools::install_github("maxbiostat/poweRlaw")
library(poweRlaw)
data(moby)
mm <- dislnorm$new(moby)

counter <- 0
maxOp <- 100
system.time(
  repeat {
    counter <-  counter + 1
    estimate_pars(mm)
    if (counter == maxOp){
      break
    }
  }
)

#### Original implementation 
# user  system elapsed 
# 51.611   0.074  51.730  
# > estimate_pars(mm)
# $pars
# [1] -3.582645  2.760856
# 
# $value
# [1] 40117.15
# 
# $counts
# function gradient 
# 18       18 
# 
# $convergence
# [1] 0
# 
# $message
# [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
# 
# attr(,"class")
# [1] "estimate_pars"

#### Compressed implementation
# user  system elapsed 
#  2.289   0.023   2.312 
# > estimate_pars(mm)
# $pars
# [1] -3.582645  2.760856
# 
# $value
# [1] 40117.15
# 
# $counts
# function gradient 
# 18       18 
# 
# $convergence
# [1] 0
# 
# $message
# [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
# 
# attr(,"class")
# [1] "estimate_pars"

csgillespie commented 5 years ago

Thanks. A PR would be great, with a couple of minor caveats:

I was sure I did this (but clearly I didn't). So perhaps I'm remembering somewhere else in the code. But that's up to me to find out.
I'm busy this week, but should be able to look at it next week?
Can you squash all commits in your repo into a single commit

maxbiostat commented 5 years ago

Squashing done. Please take your time to review this. Thanks for implementing these techniques in an open-source package. Least I can do is contribute a little bit. I'm still a bit puzzled why pulling the frequencies and values from internal wouldn't work; maybe I'm doing something wrong. I reckon you planned to have some sort of compression, and that's why internal includes freq and values.

csgillespie commented 5 years ago

@maxbiostat Thanks for the PR #80. I've merged and made a few changes.

One of my plans is for the discrete distributions to be able to handle much larger datasets. This means avoiding using the underlying data and only using freq/values, i.e. the data compression idea.

Happy for further PR on other discrete distributions. Thanks

csgillespie / poweRlaw

data compression for likelihood #79