RfastOfficial / Rfast

A collection of Rfast functions for data analysis. Note 1: The vast majority of the functions accept matrices only, not data.frames. Note 2: Do not have matrices or vectors with have missing data (i.e NAs). We do no check about them and C++ internally transforms them into zeros (0), so you may get wrong results. Note 3: In general, make sure you give the correct input, in order to get the correct output. We do no checks and this is one of the many reasons we are fast.
139 stars 19 forks source link

`dirimultinom.mle()` gives incorrect answers #53

Closed GabrielHoffman closed 2 years ago

GabrielHoffman commented 2 years ago

I have used Rfast in many applications. I would like to use dirimultinom.mle() to estimate the parameters from a Dirichlet Multinomial model. Other packages also provide this capability, but are very slow. I figured Rfast would provide a fast method, but it gives incorrect results.

I am interested in large scale analysis, but here is a minimal reproducible example showing the problem.

To Reproduce Here is a simple example to simulate counts from Dirichlet Multinomial. dirmult::dirmult estimates the alpha parameters correctly, but dirimultinom.mle does not.

library(HMP)
library(dirmult)
library(Rfast)

set.seed(1)

alpha = c(100, 5, 40)
n_samples = 5000
n_counts = 300
counts = Dirichlet.multinomial(rep(n_samples, n_counts), alpha)

# dirmult 
# gives accurate results
est = dirmult(counts)
est$gamma
#> [1] 102.161498   5.081866  41.066845

# Rfast
# parameter estimates are incorrect, algorithm stops after just 2 iteration
dirimultinom.mle(counts)
#> $iters
#> [1] 2
#> 
#> $loglik
#> [1] -1087840
#> 
#> $param
#> [1] 6979.6090  344.7844 2807.5673

Additional info dirimultinom.mle does work for some datasets, but when it fails it only uses 2 iterations. Looking at the code, it looks like the function uses Newton's method. I modified the code to use the true alpha values as a starting point, but the didn't fix the issue. Maybe the step size of the Newton update is should be reduced?

Desktop (please complete the following information):

statlink commented 2 years ago

Hi Gabriel,

Indeed you are right. This could be attributed to the step size of Newton-Raphson. I implemented the simple Newton-Rapshon algorithm, so that might be the case as it seems that the parameters tend to "explode". I need to check this more thorougly, but not for a while as I am busy right now. If you think you help with the NR algorithm, then be my guest.

GabrielHoffman commented 2 years ago

I have a fix where I use the log-likelihood and its gradient in optim() using L-BFGS-B. It converges quickly and works well. It is also dramatically faster than dirmult::dirmult().

I'm happy to contribute the code to Rfast if you are interested.

Cheers, Gabriel

statlink commented 2 years ago

My issue with optim is that it is drammatically slower than Newton-Raphson. In general we do not use the optim() function in Rfast. Is your gradient the same as mine? Can you please send it to my email? I might have to revisit my computations and or modify them.

statlink commented 2 years ago

Hi Gabriel.

I fixed it and it is really fast.

$iters [1] 14

$loglik [1] -1082198

$param [1] 102.161498 5.081866 41.066845

I have now added your name in the acknowledgements.