`dirimultinom.mle()` gives incorrect answers

GabrielHoffman commented 2 years ago

I have used Rfast in many applications. I would like to use dirimultinom.mle() to estimate the parameters from a Dirichlet Multinomial model. Other packages also provide this capability, but are very slow. I figured Rfast would provide a fast method, but it gives incorrect results.

I am interested in large scale analysis, but here is a minimal reproducible example showing the problem.

To Reproduce Here is a simple example to simulate counts from Dirichlet Multinomial. dirmult::dirmult estimates the alpha parameters correctly, but dirimultinom.mle does not.

library(HMP)
library(dirmult)
library(Rfast)

set.seed(1)

alpha = c(100, 5, 40)
n_samples = 5000
n_counts = 300
counts = Dirichlet.multinomial(rep(n_samples, n_counts), alpha)

# dirmult 
# gives accurate results
est = dirmult(counts)
est$gamma
#> [1] 102.161498   5.081866  41.066845

# Rfast
# parameter estimates are incorrect, algorithm stops after just 2 iteration
dirimultinom.mle(counts)
#> $iters
#> [1] 2
#> 
#> $loglik
#> [1] -1087840
#> 
#> $param
#> [1] 6979.6090  344.7844 2807.5673

Additional info dirimultinom.mle does work for some datasets, but when it fails it only uses 2 iterations. Looking at the code, it looks like the function uses Newton's method. I modified the code to use the true alpha values as a starting point, but the didn't fix the issue. Maybe the step size of the Newton update is should be reduced?

Desktop (please complete the following information):

x86_64-apple-darwin19.6.0
R 4.1.1
Rfast 2.0.4

statlink commented 2 years ago

Hi Gabriel,

Indeed you are right. This could be attributed to the step size of Newton-Raphson. I implemented the simple Newton-Rapshon algorithm, so that might be the case as it seems that the parameters tend to "explode". I need to check this more thorougly, but not for a while as I am busy right now. If you think you help with the NR algorithm, then be my guest.

GabrielHoffman commented 2 years ago

I have a fix where I use the log-likelihood and its gradient in optim() using L-BFGS-B. It converges quickly and works well. It is also dramatically faster than dirmult::dirmult().

I'm happy to contribute the code to Rfast if you are interested.

Cheers, Gabriel

statlink commented 2 years ago

My issue with optim is that it is drammatically slower than Newton-Raphson. In general we do not use the optim() function in Rfast. Is your gradient the same as mine? Can you please send it to my email? I might have to revisit my computations and or modify them.

statlink commented 2 years ago

Hi Gabriel.

I fixed it and it is really fast.

$iters [1] 14

$loglik [1] -1082198

$param [1] 102.161498 5.081866 41.066845

I have now added your name in the acknowledgements.

RfastOfficial / Rfast

`dirimultinom.mle()` gives incorrect answers #53