florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
201 stars 21 forks source link

Interpretation of lme4 residual plots #360

Open florianhartig opened 1 year ago

florianhartig commented 1 year ago

Via email:

I have two models, and I created the diagnostic plots for them using the Dharma package and the simulateResiduals() function. There seem to be a couple of significant deviations on the plots that I am unsure how to address. The commands I run to plot are:

library(DHARMa) sims_sp <- simulateResiduals(model) png('model.png') plot(sims_sp,quantreg = T) dev.off()

Model 1: The response variable is a numeric continuous variable. I used a log-normal GLMM to model it (glmer() function, gaussian family with a log link), and my fixed effects and random effects are exactly like model 1: one categorical factor and one numerical fixed effect, and several categorical random effects. Here, when I attempt to plot the diagnostic plot, I get the following error: Unable to calculate quantile regression for quantile 0.25. Possibly to few (unique) data points / predictions. Will be ommited in plots and significance calculations. Unable to calculate quantile regression for quantile 0.5. Possibly to few (unique) data points / predictions. Will be ommited in plots and significance calculations. Unable to calculate quantile regression for quantile 0.75. Possibly to few (unique) data points / predictions. Will be ommited in plots and significance calculations.

image

Model 2: Similar to model 2, the response variable has continuous numerical values. I used a normal distribution and modeled this using lmer(). The fixed and random effects are exactly like model 2. With this one, the deviation in KS test is significant, but the lmer() function will not allow me to add a random effect with one level per observation. Additionally, I get the same error as model 1 when I plot it.

image

florianhartig commented 1 year ago

Hello,

the error messages originate from the qgam package, which I use to calculate the quantile regressions plus p-values for the residual distribution. I have no idea why you get them, because visually, it seems as if you have by far enough data points to fit these regressions, and I also don't see anything unusual about this distributions. Maybe you could update the qgam package?

In any case, it is clear that the res ~ predicted plots shows considerable problems for both cases, in m1 you have heteroskedasticity, and in m2 the distribution doesn't fit at all, so most likely your problem is mainly that you are dealing with a non-normal distribution here (from m1, it looks a bit as if it may have a upper bound?).

I'm not sure why you fit a log link - is there reasons to think that this is appropriate? I would probably rather play around with transformations of the response (i.e. fit log(y) ~ x) or any of the other strategies that are usually applied to deal with non-normal residuals.

Also, note that for lmer, you can and should use the standard lme4 residual plots, see here for some ideas about lmer diagnostic plots https://theoreticalecology.github.io/AdvancedRegressionModels/2C-RandomEffects.html#model-checks

Cheers, Florian

Saannah commented 1 year ago

Hi Florian, Thank you so much for your response!

As for heteroscedasticity, the typical recommendation is to transform the dependent variable but because one of my dependent variables is categorical, I am not sure how to deal with it. I have already normalized my other dependent variable (fixed effect) to mean of 0 and sd 1. I tried binning the numerical fixed effect and including it as a random effect (because it is not the primary effect I look to study). The result looked like this:

model 2

I relied on the messages displayed on the plot, and deduced that since the previous version without binning says no significant problems detected, it is probably preferred over binning where has problems detected in the model.

The reason I used a log normal distribution for m1 is because I used qqp() to look at how my data fits in different distributions and log normal seemed to be most fitting for m1. I did the same for m2 and this is why I used a normal distribution. Do you have other recommendations for picking a distribution?

Thank you for your help!

florianhartig commented 1 year ago

Hello Sana,

As for heteroscedasticity, the typical recommendation is to transform the dependent variable

this recommendation is a bit outdated - with glmmTMB, you can just model the dispersion via dispformula, so I would transform mainly with the goal of obtaining normality, the rest can be done with dispformula

my dependent variables is categorical

I'm a bit confused about this statement - earlier, you say e.g. that "Model 1: The response variable is a numeric continuous variable."

In general, I suspect this is not so much a problem of detailed residual checks, but possibly more a problem of choosing the appropriate GLM. Could you maybe clarify how your data really looks like, i.e mainly what is your response variable?

Saannah commented 1 year ago

Dear Florian,

I am sorry for the confusion I caused. My dependent variable (y) is continuous numeric with negative and positive values for model 2, and continuous numeric positive values for model 1. As for my independent variable/fixed effects (there was a typo in my previous comment where I wrote dependent instead of independent) one of them is a categorical factor and the other one of them is numerical and continuous positive values in both model 1 and model 2. I also have several categorical random effects for model 1 and model 2.

  1. Because I have a categorical x (fixed effect) I am having trouble finding a way to fix the heteroscedasticity.
  2. because I have negative values in my response variable (y) I cannot transform them using log() as you advised in your first response.
  3. I chose the distribution for my GLMM using qqp plots and testing different distributions. Do you have other recommendations for addressing this if you believe that it might be a general problem of picking an appropriate model?
  4. Does the diagnostic plot do a test for heteroscedasticity? Because the message says no significant issues detected. I am also confused about what the scatter plot should look in a non-problematic model.

thank you again and sorry for the inconvenience!

florianhartig commented 1 year ago

Hello Sana,

first point: you are addressing the questions in the wrong order. As the distribution affects the dispersion, it doesn't make sense to worry about heteroskedasticity before you have fixed the distribution.

Regarding your distribution: what we can say is that neither model seems to fit super well, and in particular model 2 looks very odd. I can't do a "distance diagnostic" without having your data, but it looks as if you have a bounded response? All I can say is look at standard GLM textbooks, which will tell you that you can modify:

to model your distribution. For strictly positive data, Gamma is common, and Beta for bounded responses. In most cases, thinking about the nature of your data will already go a long way.

Once you have fixed your distribution, you can fix the dispersion via dispformula in glmmTMB - this can be done for categorical and numeric predictors alike.