const-ae / glmGamPoi

Fit Gamma-Poisson Generalized Linear Models Reliably
105 stars 15 forks source link

Inconsistent dispersion estimate between MLE and quasi-MLE #45

Closed anglixue closed 1 year ago

anglixue commented 1 year ago

Hi, I am trying to understand the difference between MLE and quasi-MLE results. So I did a simple simulation:

1000 genes x 500 cells, every gene has the same mean expression = 3 and dispersion = 5

# Simulate a matrix with 1000 genes x 500 cells per gene on the same gene with NB distribution
m = 1000
n = 500
mu = 3 # every gene has the same mean expression
size = 0.2 # inverse of the dispersion = 5
data = matrix(NA, nrow = m, ncol = n)

for(i in 1:m){
  data[i,] = rnbinom(n = n, size = size, mu = mu)
}

fit = glmGamPoi::glm_gp(as.matrix(data), size_factors = FALSE, verbose = T)
theta_mle = fit$overdispersions

m_est = rowMeans(data)

v_est = apply(data, 1, function(x)var(x))

theta_moe = (v_est-m_est)/(m_est^2)

res = data.frame(method=rep(c("moe","mle","ql_mle","ql_mle_shrunken"),each=1000),
               m_est=rep(m_est,each=4), theta_est=c(theta_moe,theta_mle,fit$overdispersion_shrinkage_list$ql_disp_estimate,fit$overdispersion_shrinkage_list$ql_disp_shrunken))

boxplot(theta_est~method, res)
abline(h = 1/size, lty = 3)
abline(h = 1, lty = 3)

However, when I checked the estimate from four methods: moment, MLE, ql_mle, ql_mle_shruken, I found that MOE and MLE provide relatively accurate estimate but the ql_mle (quasi-likelihood dispersion estimates based on the dispersion trend) and ql_mle_shruken (shrunken quasi-likelihood dispersion estimates) are always close to 1 (see plot below).

Screenshot 2023-02-07 at 2 13 54 pm

Then I further check the fit results

> summary(fit)
glmGamPoiFit object:
The data had 1000 rows and 500 columns.
A model with 1 coefficient was fitted.
The design formula is: Y~1

Beta:
            Min 1st Qu. Median 3rd Qu. Max
Intercept 0.673    1.02   1.09    1.16 1.4

deviance:
 Min 1st Qu. Median 3rd Qu. Max
 361     398    408     417 456

overdispersion:
  Min 1st Qu. Median 3rd Qu.  Max
 3.59    4.71   4.98    5.29 6.82

Shrunken quasi-likelihood overdispersion:
   Min 1st Qu. Median 3rd Qu.  Max
 0.738    0.95  0.997    1.06 1.33

size_factors: FALSE

Mu:
  Min 1st Qu. Median 3rd Qu.  Max
 1.96    2.78   2.99     3.2 4.04

I found the difference came from the

dispersion_trend

and

ql_disp_trend

> summary(fit$overdispersion_shrinkage_list$dispersion_trend)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.885   4.944   4.979   4.978   4.999   5.152 
> summary(fit$overdispersion_shrinkage_list$ql_disp_trend)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.005655 0.787480 0.926023 0.934233 1.104975 1.186770 

I wonder if you have any ideas why the ql_disp_trend isn't consistent with the simulated parameter (theta = 5)? Thanks very much for your help!

const-ae commented 1 year ago

Hi Angli,

thanks for reaching out. I think the confusion stems from the definition of the quasi-likelihood dispersion estimates. They are defined as: $disp{ql} = (1 + \mu * disp{mle}) / (1 + \mu * disp{trend})$. The motivation for this equation is simply equating the expression of the variance for the two models: $var = \mu + disp{mle} \mu^2$ and $var = disp{ql} (\mu + disp{trend} \mu^2)$

Thus I am not surprised that the fit$overdispersion_shrinkage_list$ql_disp_trend vary around 1. For more information on the idea of the quasi-Negative binomial model and why shrinkage is easier there take a look at Lund et al. (2012) and the help page for ?overdispersion_shrinkage.

Best, Constantin