florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
214 stars 22 forks source link

Issue Interpreting residual patterns for a GLMER Binomial Logistic Regression Model #448

Open coanv opened 1 month ago

coanv commented 1 month ago

Hi,

I am using a multi-level "within/between" or "hybrid" model specification to predict the likelihood of holding credit card debt (0 = no, 1 = yes) as a function of caregiving status (0 = not caregiver, 1 = caregiver) and a slew of control predictors. I have 6 waves of panel data where observations are nested in people (grouping variable: HHIDPN). I've determined that a hybrid model specification (using group centered means and deviations for time variant predictors) is a better fit than a mixed model specification via an LR test using anova(). My hybrid model currently has multiple random effects (Random Intercept on Person ID (HHIDPN) and Random Slope on Year2)

This is the model specification I'm attempting to run diagnostics on using DHARMA mHF2<- glmer(haveccd ~ gender + coll + race + year2 + care2a.m + rmstatc.m+ employ.m+homeown.m + selfhlth.m +scale(rageyb.m) +hhhres.m +loghinc_real.m+ care2a.d+rmstatc.d+employ.d+homeown.d+selfhlth.d+scale(rageyb.d)+hhhres.d+ loghinc_real.d+(1|HHIDPN)+(year2-1|HHIDPN) ,data=ch1anly, family=binomial, control=glmerControl(optimizer="nlminbwrap"))

Following the instructions outlined here I have run the following code simulationOutput <- simulateResiduals(fittedModel = mHF2) plot(simulationOutput, asFactor = TRUE)

Which gives me these plots image

My understanding is I should be seeing a boxplot for the graph on the right given that I specified asFactor = TRUE and I'm not sure why that is not happening.

I then tried to test for possible zero-inflation given that the majority of my sample does not have credit card debt and is therefore coded as 0.

testZeroInflation(simulationOutput)

And I get this output

image

image

I think this is telling me that I do have zero inflation and need to adjust for that. But I'm not sure if I'm understanding the output correctly and if I am, what my next step should be. I'm hoping someone can help me understand why I am not getting a boxplot for my residuals despite specifying asFactor = TRUE, and also help me understand what this output means and what my next steps should be in trying to find the best fitting model.

melina-leite commented 1 month ago

Hi @coanv, Regarding the asFactor argument in function plotResiduals: It seems you have many data points (>10,000), and because of that, the scatterplot is replaced by graphics::smoothScatter(), independently if you are using asFactor= T. The argument works when you have many repeated values in the predictions of your model, which doesn't seem to be your case either. The example below shows that if you try to use asFactor = T to a DHARMa residual object with as many unique predictions as the number of observations, the function will create unique boxplots for each prediction:

data <- createData(sampleSize = 1000)
mod <- lme4::glmer(observedResponse ~Environment1 + (1|group), data=data,
             family=poisson())
res <- simulateResiduals(mod)
plotResiduals(res)
plotResiduals(res, asFactor=T)

nuniq = length(unique(res$fittedPredictedResponse))

If I had >10,000 data points the plot would be a smoothScatter, also ignoring asFactor=T

smoothScatter(x = res$fittedPredictedResponse, y = res$scaledResiduals,
                                     ylim = c(0,1), axes = FALSE,
                                     colramp = colorRampPalette(c("white", "darkgrey")))

smoothScatter(x = as.factor(res$fittedPredictedResponse), y = res$scaledResiduals,
              ylim = c(0,1), axes = FALSE,
              colramp = colorRampPalette(c("white", "darkgrey")))

Regarding the Zero Inflation test: Well, I'm not so sure about it, maybe @florianhartig can help here. First, I wondered if there would be such zero inflation in a Bernoulli distribution. I found this and this discussion online, but I also found papers like this one to puzzle me.

On the one hand, your model expects more zeros than the real data has. So, I was wondering if it could be the "one-inflated" case (? you could use testGeneric to test for one inflated see the help of the function). However, for the same reasons above, I don't know if such a problem exists in Bernoulli data.

On the other hand, given that you have such a huge dataset (>85,000?), the chance that a small departure from the expected zeros (or ones) will be significant is large. I'd then interpret the ratio value, which is pretty much close to 1. So, I'd not bother too much about it. However, take a look at the values of the KS and dispersion tests, too (not much the significance).

:)

florianhartig commented 1 month ago

Hi,

  1. Regarding asFactor: as Melina said, asFactor = T doesn't make sense if you don't have a single categorical predictor or at least very few predictors. However, the interaction of this option with smoothScatter is a bit odd - @melina-leite should we fix this so that asFactor also works for n>10.000?

  2. Regarding the plots: @coanv, the more sensible plot option in your case is to set quantreg = T, which forces the calculation of the quantile regression lines. This will take a while, but will make your plot much better interpretable. I would recommend that you do this for res ~ fitted and res ~ predictors

  3. Regarding the zeros: what the test shows you is not zero-inflation, but you have too few zeros. The tests also show some deviations from the distribution and the dispersion. It is uncommon (actually not really possible) to have these problems in 0/1 data, and effects are (in their size) really small, so I believe that this (i.e. the significant p-values) is created by the strong REs in your model in combination with your large sample size. Try to re-run the simulation with option re.form = NULL in the simulate residuals. This should create conditional simulations. If tests are fine then, there's nothing to worry.

Best F