check_overdispersion underestimates dispersion in mixed models

Istalan commented 1 year ago

Hi,

so I was playing around with mixed Poisson models and noticed that for large numbers of random coefficients (also called random effects, i guess) the dispersion ratio from check_overdispersion was too low.

I used code from the Ben Bolker GLMM-FAQ (https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion) and modified it for 200 random coefficients:

library(lme4)
set.seed(101)  
d <- data.frame(x=runif(1000),
                f=factor(sample(1:200,size=1000,replace=TRUE))) # modified for more random effects
suppressMessages(d$y <- simulate(~x+(1|f), family=poisson,
                                 newdata=d,
                                 newparams=list(theta=1,beta=c(0,2)))[[1]])
m1 <- glmer(y~x+(1|f),data=d, family=poisson)
#overdisp_fun(m1)
performance::check_overdispersion(m1)

Now this should be a perfect model, but the dispersion ratio is only about 0.8 not 1. Looking deeper the assumed residual degrees of freedom are 997, which is widely inconsitent with the result of 814

x <- performance::check_overdispersion(m1)
x$residual_df
pchisq(x$chisq_statistic, df = x$residual_df)

Now I assumed, that the issue is that overdispersion is a phenomenon on the conditional residuals, as the random coefficients have a scaling factor. So they act as fixed coefficients in this analysis and should be subtracted from the degrees of freedom which gives 200 RC the intercept and x makes 798 df. Checking with 200 simulations you have to modify Ben Bolkers code to avoid the issue with lambda < 5, but you get results that look pretty consistent

res <- replicate(200, {
  d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
                  f=factor(sample(1:200,size=1000,replace=TRUE)))
  suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
                                   newdata=d,
                                   newparams=list(theta=1,beta=c(0,2)))[[1]])
  m1 <- glmer(y~x+(1|f),data=d, family=poisson)
  performance::check_overdispersion(m1)[[1]]
})
# doesn't fit df = 997
qqplot(res, rchisq(10000, 997))
abline(0, 1)
# does fit df = 798
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)

Now i wanted to make sure i got it right and used a much larger simulation and here the chisq seem to have a mean closer to 799 which confuses me, but is perhaps explained by remaining imprecision of the pearson residual:

library(parallel)
# 20,000 simulations on 10 cores, took 6 minutes on my machine
res_list <- mclapply(1:20000, function(i) {
    d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
                    f=factor(sample(1:200,size=1000,replace=TRUE)))
    suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
                                     newdata=d,
                                     newparams=list(theta=1,beta=c(0,2)))[[1]])
    m1 <- glmer(y~x+(1|f),data=d, family=poisson)
    performance::check_overdispersion(m1)[[1]]
}, mc.cores = 10)   

res <- unlist(res_list)
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)

The same problem exists in the overdisp_fun from the GLMM-FAQ. Do you think I should post there as well?

Session info:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] performance_0.8.0 lme4_1.1-27.1     Matrix_1.2-18    

loaded via a namespace (and not attached):
 [1] minqa_1.2.4     MASS_7.3-51.6   compiler_3.6.3  tools_3.6.3     Rcpp_1.0.7      splines_3.6.3   nlme_3.1-147   
 [8] grid_3.6.3      insight_0.14.5  nloptr_1.2.2.3  boot_1.3-25     lattice_0.20-41

strengejacke commented 1 year ago

Tagging @bbolker as well...

bbolker commented 1 year ago

This is an interesting question. To be honest I haven't seen anyone discuss/derive this explicitly for GLMMs; as with so much of the rest of the GLMM literature, it's "this works for GLMs, I think the extension to GLMMs is correct". The simulation results are reasonably convincing, but: I would guess that the effective "residual degrees of freedom" computation is probably going to depend on the amount of shrinkage/size of the RE variances relative to the Poisson variance? (At the very least the GLMM FAQ needs a cautionary note: @Istalan, posting an issue there (replicate or point to this issue) would be great ...)

strengejacke commented 3 months ago

Use the new machinery from #643, the dispersion ratio looks good when we use simulated residuals (based on DHARMa):

library(performance)
library(lme4)
#> Loading required package: Matrix
set.seed(101)
d <- data.frame(
  x = runif(1000),
  f = factor(sample(1:200, size = 1000, replace = TRUE))
) # modified for more random effects
suppressMessages(d$y <- simulate(~ x + (1 | f),
  family = poisson,
  newdata = d,
  newparams = list(theta = 1, beta = c(0, 2))
)[[1]])
m1 <- glmer(y ~ x + (1 | f), data = d, family = poisson)

check_overdispersion(m1)
#> # Overdispersion test
#> 
#>        dispersion ratio =   0.817
#>   Pearson's Chi-Squared = 814.179
#>                 p-value =       1
#> No overdispersion detected.

check_overdispersion(simulate_residuals(m1))
#> # Overdispersion test
#> 
#>  dispersion ratio = 0.953
#>           p-value = 0.544
#> No overdispersion detected.

^{Created on 2024-03-16 with reprex v2.1.0}

This raises the question whether we should use that approach in general for mixed models? @bbolker

easystats / performance

check_overdispersion underestimates dispersion in mixed models #464