Heteroskedasticity in residual plot for mgcv::gam with family=nb()

florianhartig commented 3 years ago

A user asks about these residual plots:

If they are acceptable
What one could do to solve the problem

m1.i <-gam(SR ~ Jahr1 +s(ele, by=Jahr1, k = 5)+ s(Plot, bs="re"), family =nb(),data=alshort) m1.i <-gam(SR ~ s(ele, k=5)+Jahr1+ s(Plot, bs="re"), family =nb(),data=alshort) m1.i$family$getTheta(TRUE)#

m1.k1 <-gam(SR ~ s(ele, k=5)+Jahr1+ s(Plot, bs="re"), family = negbin(theta=4.179807 ),data=alshort)# theta aus m1.i # lowest AIC; not overdispersed ;

florianhartig commented 3 years ago

OK, in terms of the question: this clearly doesn't look like random, i.e. there seems to be a misfit / heteroskedasticity. What is odd though is the increasing pattern to the right - usually, this happens only in super simple models, that cannot fit the data, but here we have a gam, so one wouldn't expect this problem to occur.

What I suspect is the problem that in the plot, you use the s(Plot, bs="re") trick - DHARMa will not recognise this as a real RE and calculate model predictions including the s(Plot, bs="re") term. However, when plotting residuals against model predictions including the RE, this kind of pattern can occur, see https://github.com/florianhartig/DHARMa/issues/43

The solution would be to plot residuals against predictions, but remove the s(Plot, bs="re") term from the predictions. If the residuals look find then, I think the current plot should be disregarded.

florianhartig commented 3 years ago

I should add though that I wasn't able to re-create the problem described in #43 when simulating with a gam, possible predict.gam excludes the REs?

library(DHARMa)
testData = createData(sampleSize = 200, randomEffectVariance = 1.5, family = poisson(), numGroups = 20)
library(mgcv)
fit <- gam(observedResponse ~ Environment1 +  s(group, bs="re"), family = "poisson",data=testData)
simulateResiduals(fit, plot = T)

Whatever the reason, it seems DHARMa performs fine for this model combination. Based on this, one would conclude that you really have heteroskedasticity here - in this case, the best option would be to move towards a package that allows to mode the dispersion, e.g. glmmTMB when you are willing to give up the splines, or brms if not.

dschoenig commented 3 years ago

For a GAM, any kind of pattern in the residual plots may be the result of choosing the basis dimension for a smooth term too low. In the original call to gam, k = 5 may be too low. mgcv::k.check() may be used for this (and actually should be used for every GAM before even looking at residuals). In this case, increasing k to 20 and refitting may help.

However it is true that there seems to be heteroscedasticity involved that may not be the result of a smooth term that is too restricted. There are some families to set up location scale models in mgcv, and in this case one could try a Tweedie location scale model (see ?mgcv::twlss). In this case it's probably good to set gam(... , method = "efs"), in order to use the extended Fellner Schall method that plays nicely with this kind of model.

florianhartig commented 3 years ago

generally true, but in the case of such a diagonal pattern, it seems hard to believe that this is a smoothing problem - one would think that it would be possible to improve this model with a linear term only. To me, this looks much more like the RE problem described in #43, which essentially arrises because RE group means are not biased towards the RE mean (through the RE shrinkage)

Istalan commented 3 years ago

Hi,

I also took a look at this and while trying to more closely match the original analysis I found the issue seemingly stemming from theta being estimated as too small.

Also I couldn’t use DHARMa on the negbin gam models. Not sure if I’m missing something, but pearson residuals should do fine.

In short, I simulate data with a theta of 10 then fit a model with nb() which finds a theta of 2.2. Fitting one model with 2.2 and a “correct” one with the original 10 the first model shows the same trend, the correct one doesn’t or at least not nearly as much.

library(DHARMa)
library(mgcv)
library(ggplot2)

orig_theta <- 10
testData = createData(sampleSize = 1000, fixedEffects = 0.5,
                      intercept = 2, # removes zeros at the left edge 
                      randomEffectVariance = 1.5, 
                      family = negbin(theta = orig_theta),
                      numGroups = 200)

fit_nb <- gam(observedResponse ~ Environment1 +  s(group, bs="re"), 
            family = nb(), 
           data=testData)
# compare
fit_nb$family$getTheta() # >2 <2.5
orig_theta # 10
fit <- gam(observedResponse ~ Environment1 +  s(group, bs="re"), 
           family = negbin(theta = fit_nb$family$getTheta()), 
           data=testData)
fit_correct <- gam(observedResponse ~ Environment1 +  s(group, bs="re"), 
                  family = negbin(theta = orig_theta), 
                  data=testData)

df <- data.frame(pred = predict(fit, type = "link"),
                 res = residuals(fit, type = "pearson"))

ggplot(data = df, aes(x = pred)) + geom_point(aes(y = res)) + 
  geom_smooth(aes(y = res, color = "mean")) +
  geom_smooth(aes(y = res^2, color = "variance")) +
  geom_hline(yintercept = 0, lty = 2, size = 2) + 
  ggtitle("With fitted theta")

df_c <- data.frame(pred = predict(fit_correct, type = "link"),
                 res = residuals(fit_correct, type = "pearson"))

ggplot(data = df_c, aes(x = pred)) + geom_point(aes(y = res)) + 
  geom_smooth(aes(y = res, color = "mean")) +
  geom_smooth(aes(y = res^2, color = "variance"))+
  geom_hline(yintercept = 0, lty = 2, size = 2)+ 
  ggtitle("With original theta")

fitted_theta

It does not appear to be caused by a small initial theta in nb()

fit3 <- gam(observedResponse ~ Environment1 +  s(group, bs="re"), 
            family = nb(-12), # see help -12 -> 12 as starting value
            data=testData)
# compare
fit3$family$getTheta() # same as before
orig_theta

dschoenig commented 3 years ago

I should add though that I wasn't able to re-create the problem described in #43 when simulating with a gam, possible predict.gam excludes the REs?

That's interesting! I think predict.gam actually includes all smooth terms by default, so also the REs. THis can be checked by using type = "lpmatrix in that example (which gives the matrix that can produce the linear predictor for each observation when multiplied with the coefficient vector):

library(DHARMa)
testData = createData(sampleSize = 200, randomEffectVariance = 1.5, family = poisson(), numGroups = 20)
library(mgcv)
fit <- gam(observedResponse ~ Environment1 +  s(group, bs="re"), family = "poisson",data=testData)

# Get linear predictor matrix
lp <- predict(fit, type = "lpmatrix")

# Groups terms are included
lp[c(1,11,21), 1:5]

   (Intercept) Environment1 s(group).1 s(group).2 s(group).3
1            1 -0.859355396          1          0          0
11           1  0.336043251          0          1          0
21           1 -0.006868266          0          0          1

The predictions coming out of this are the same as by default from predict.gam:

lp_pred <- lp %*% coef(fit)
all(lp_pred[,1] == predict(fit))

Not sure, but could it be possible that for mgcv models DHARMa actually calculates conditional residuals and uses conditional predictions for the plot?

florianhartig commented 3 years ago

Hi all, @Istalan and @dschoenig, thanks for your comments, a few responses:

@Istalan, thanks for this, it's interesting to see that mgcv messes up the dispersion so profoundly, fitting

fit_glmmTMB <- glmmTMB(observedResponse ~ Environment1 +  (1|group), 
              family = nbinom2, 
              data=testData)

works perfectly fine. One would conclude that the s(group, bs="re") term must be under-penalised. Other than that, I can't really deduce much from your example, I don't think the fit through the Pearson residuals tells us much about the DHARMa residuals, as Pearson residuals can show weird patterns anyway.

About DHARMa: I can't get DHARMa working with either nb() or negbin(), does this work for you? I wonder how the user did this, or if these are really the correct nb fits. Looking at #11 suggests to me that this should have never worked, so I wonder how the plots were produced that I received via email.
@dschoenig, thanks for the comments. If the predictions are per default made conditionally on fitted REs, this may create the bias described. However, as I said, I didn't see it in the simulations. Possibly, the reason why the issue doesn't appear is that the REs are underpenalized (see point 1) because the patterns in #43 are created by the RE shrinkage, so if the REs in s(group, bs="re") effectively behave more like fixed effects, one would not expect that the pattern arises.

Overall, I'm not sure if we can make much progress here. I have just looked at an old issue which I never followed up on, which is #11.

My tendency would be to close this but link to #11 and check the implementation again once the extended mccv families are supported by DHARMa.

florianhartig commented 3 years ago

OK, I have solved at least one problem: @Istalan, I think fit_nb$family$getTheta() just doesn't work, or is the wrong way to access the fitted theta parameter - compare the difference

fit_nb <- gam(observedResponse ~ Environment1 +  s(group, bs="re"), 
              family = nb(), 
              data=testData)

getSimulations(fit_nb)

simulate(fit_nb)
summary(fit_nb)
fit_nb$family$getTheta()

simulate / summary have the right values, the only issue is that simulate doesn't work.

florianhartig / DHARMa

Heteroskedasticity in residual plot for mgcv::gam with family=nb() #282