Open florianhartig opened 3 years ago
OK, in terms of the question: this clearly doesn't look like random, i.e. there seems to be a misfit / heteroskedasticity. What is odd though is the increasing pattern to the right - usually, this happens only in super simple models, that cannot fit the data, but here we have a gam, so one wouldn't expect this problem to occur.
What I suspect is the problem that in the plot, you use the s(Plot, bs="re") trick - DHARMa will not recognise this as a real RE and calculate model predictions including the s(Plot, bs="re") term. However, when plotting residuals against model predictions including the RE, this kind of pattern can occur, see https://github.com/florianhartig/DHARMa/issues/43
The solution would be to plot residuals against predictions, but remove the s(Plot, bs="re") term from the predictions. If the residuals look find then, I think the current plot should be disregarded.
I should add though that I wasn't able to re-create the problem described in #43 when simulating with a gam, possible predict.gam excludes the REs?
library(DHARMa)
testData = createData(sampleSize = 200, randomEffectVariance = 1.5, family = poisson(), numGroups = 20)
library(mgcv)
fit <- gam(observedResponse ~ Environment1 + s(group, bs="re"), family = "poisson",data=testData)
simulateResiduals(fit, plot = T)
Whatever the reason, it seems DHARMa performs fine for this model combination. Based on this, one would conclude that you really have heteroskedasticity here - in this case, the best option would be to move towards a package that allows to mode the dispersion, e.g. glmmTMB when you are willing to give up the splines, or brms if not.
For a GAM, any kind of pattern in the residual plots may be the result of choosing the basis dimension for a smooth term too low. In the original call to gam, k = 5
may be too low. mgcv::k.check()
may be used for this (and actually should be used for every GAM before even looking at residuals). In this case, increasing k
to 20 and refitting may help.
However it is true that there seems to be heteroscedasticity involved that may not be the result of a smooth term that is too restricted. There are some families to set up location scale models in mgcv
, and in this case one could try a Tweedie location scale model (see ?mgcv::twlss)
. In this case it's probably good to set gam(... , method = "efs")
, in order to use the extended Fellner Schall method that plays nicely with this kind of model.
generally true, but in the case of such a diagonal pattern, it seems hard to believe that this is a smoothing problem - one would think that it would be possible to improve this model with a linear term only. To me, this looks much more like the RE problem described in #43, which essentially arrises because RE group means are not biased towards the RE mean (through the RE shrinkage)
Hi,
I also took a look at this and while trying to more closely match the original analysis I found the issue seemingly stemming from theta being estimated as too small.
Also I couldn’t use DHARMa on the negbin gam models. Not sure if I’m missing something, but pearson residuals should do fine.
In short, I simulate data with a theta of 10 then fit a model with nb() which finds a theta of 2.2. Fitting one model with 2.2 and a “correct” one with the original 10 the first model shows the same trend, the correct one doesn’t or at least not nearly as much.
library(DHARMa)
library(mgcv)
library(ggplot2)
orig_theta <- 10
testData = createData(sampleSize = 1000, fixedEffects = 0.5,
intercept = 2, # removes zeros at the left edge
randomEffectVariance = 1.5,
family = negbin(theta = orig_theta),
numGroups = 200)
fit_nb <- gam(observedResponse ~ Environment1 + s(group, bs="re"),
family = nb(),
data=testData)
# compare
fit_nb$family$getTheta() # >2 <2.5
orig_theta # 10
fit <- gam(observedResponse ~ Environment1 + s(group, bs="re"),
family = negbin(theta = fit_nb$family$getTheta()),
data=testData)
fit_correct <- gam(observedResponse ~ Environment1 + s(group, bs="re"),
family = negbin(theta = orig_theta),
data=testData)
df <- data.frame(pred = predict(fit, type = "link"),
res = residuals(fit, type = "pearson"))
ggplot(data = df, aes(x = pred)) + geom_point(aes(y = res)) +
geom_smooth(aes(y = res, color = "mean")) +
geom_smooth(aes(y = res^2, color = "variance")) +
geom_hline(yintercept = 0, lty = 2, size = 2) +
ggtitle("With fitted theta")
df_c <- data.frame(pred = predict(fit_correct, type = "link"),
res = residuals(fit_correct, type = "pearson"))
ggplot(data = df_c, aes(x = pred)) + geom_point(aes(y = res)) +
geom_smooth(aes(y = res, color = "mean")) +
geom_smooth(aes(y = res^2, color = "variance"))+
geom_hline(yintercept = 0, lty = 2, size = 2)+
ggtitle("With original theta")
It does not appear to be caused by a small initial theta in nb()
fit3 <- gam(observedResponse ~ Environment1 + s(group, bs="re"),
family = nb(-12), # see help -12 -> 12 as starting value
data=testData)
# compare
fit3$family$getTheta() # same as before
orig_theta
I should add though that I wasn't able to re-create the problem described in #43 when simulating with a gam, possible predict.gam excludes the REs?
That's interesting! I think predict.gam actually includes all smooth terms by default, so also the REs. THis can be checked by using type = "lpmatrix
in that example (which gives the matrix that can produce the linear predictor for each observation when multiplied with the coefficient vector):
library(DHARMa)
testData = createData(sampleSize = 200, randomEffectVariance = 1.5, family = poisson(), numGroups = 20)
library(mgcv)
fit <- gam(observedResponse ~ Environment1 + s(group, bs="re"), family = "poisson",data=testData)
# Get linear predictor matrix
lp <- predict(fit, type = "lpmatrix")
# Groups terms are included
lp[c(1,11,21), 1:5]
(Intercept) Environment1 s(group).1 s(group).2 s(group).3
1 1 -0.859355396 1 0 0
11 1 0.336043251 0 1 0
21 1 -0.006868266 0 0 1
The predictions coming out of this are the same as by default from predict.gam
:
lp_pred <- lp %*% coef(fit)
all(lp_pred[,1] == predict(fit))
Not sure, but could it be possible that for mgcv models DHARMa actually calculates conditional residuals and uses conditional predictions for the plot?
Hi all, @Istalan and @dschoenig, thanks for your comments, a few responses:
fit_glmmTMB <- glmmTMB(observedResponse ~ Environment1 + (1|group),
family = nbinom2,
data=testData)
works perfectly fine. One would conclude that the s(group, bs="re") term must be under-penalised. Other than that, I can't really deduce much from your example, I don't think the fit through the Pearson residuals tells us much about the DHARMa residuals, as Pearson residuals can show weird patterns anyway.
About DHARMa: I can't get DHARMa working with either nb() or negbin(), does this work for you? I wonder how the user did this, or if these are really the correct nb fits. Looking at #11 suggests to me that this should have never worked, so I wonder how the plots were produced that I received via email.
@dschoenig, thanks for the comments. If the predictions are per default made conditionally on fitted REs, this may create the bias described. However, as I said, I didn't see it in the simulations. Possibly, the reason why the issue doesn't appear is that the REs are underpenalized (see point 1) because the patterns in #43 are created by the RE shrinkage, so if the REs in s(group, bs="re") effectively behave more like fixed effects, one would not expect that the pattern arises.
Overall, I'm not sure if we can make much progress here. I have just looked at an old issue which I never followed up on, which is #11.
My tendency would be to close this but link to #11 and check the implementation again once the extended mccv families are supported by DHARMa.
OK, I have solved at least one problem: @Istalan, I think fit_nb$family$getTheta() just doesn't work, or is the wrong way to access the fitted theta parameter - compare the difference
fit_nb <- gam(observedResponse ~ Environment1 + s(group, bs="re"),
family = nb(),
data=testData)
getSimulations(fit_nb)
simulate(fit_nb)
summary(fit_nb)
fit_nb$family$getTheta()
simulate / summary have the right values, the only issue is that simulate doesn't work.
A user asks about these residual plots: