florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
201 stars 21 forks source link

Residual pattern in Gamma GLM with glmmTMB #355

Closed nl101 closed 1 year ago

nl101 commented 1 year ago

I've constructed a gamma glmm solely for the purpose of doing pairwise comparisons of each group (Month2), but after building a model the residuals seem to still have a pattern (not sure if the magnitude is cause for concern or how to decide what is or what isn't). Does anything else stick out? Is dispersion a problem?

> table(env$CYR, env$Month2)

       Jan Feb Mar Apr Jul Aug Sep Oct Dec
  2008  21   0   0   0  46   0   0   0   0
  2009  47   0   0   0  47   0   0   0   0
  2011  47   0   0   0  47   0   0   0   0
  2012  47   0   0   0   0  46   0   0   0
  2013   0   0  47   0   0   0  47   0   0
  2014   0   0  31   8   0  30  12   0   0
  2015   0   0  30  17   0   0  47   0   0
  2016   0   0  24  23   0   0  47   0   0
  2017   0   0  47   0   0   0   0  47   0
  2018   0  20  27   0   0   0  47   0   0
  2019   0   0  47   0   0   0  47   0   0
  2020   0   0  47   0   0   0  47   0   0
  2021   0   0   0   0   0   0  47   0   0
  2022   0   9  38   0   0   0   0   0   0

> mod <- glmmTMB(temp ~ Month2 + time2 + (1|Site) + (1|CYR), family = Gamma(link = "log"), 
+                 ziformula = ~0, dispformula = ~1, data = env, REML = FALSE)

> summary(mod)
 Family: Gamma  ( log )
Formula:          temp ~ Month2 + time2 + (1 | Site) + (1 | CYR)
Data: env

     AIC      BIC   logLik deviance df.resid 
  4399.6   4460.5  -2187.8   4375.6     1169 

Random effects:

Conditional model:
 Groups Name        Variance  Std.Dev. 
 Site   (Intercept) 6.078e-12 2.465e-06
 CYR    (Intercept) 1.260e-03 3.550e-02
Number of obs: 1181, groups:  Site, 47; CYR, 14

Dispersion estimate for Gamma family (sigma^2): 0.00319 

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 2.972415   0.017232  172.50  < 2e-16 ***
Month2Feb   0.144595   0.018664    7.75 9.40e-15 ***
Month2Mar   0.082474   0.014807    5.57 2.55e-08 ***
Month2Apr   0.064570   0.016746    3.86 0.000115 ***
Month2Jul   0.353848   0.007126   49.65  < 2e-16 ***
Month2Aug   0.326811   0.010893   30.00  < 2e-16 ***
Month2Sep   0.276007   0.014962   18.45  < 2e-16 ***
Month2Oct   0.211488   0.018779   11.26  < 2e-16 ***
time2       0.288161   0.022292   12.93  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

ypred = predict(mod)
res = residuals(mod, type = 'pearson')
plot(ypred,res)

gamma resid

hist(res) distrib resid

sim_res <- DHARMa::simulateResiduals(mod, 250)
plot(sim_res,asFactor = F)

DHARMa resid nonpar disp

> DHARMa::testDispersion(mod)

    DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
dispersion = 0.89204, p-value = 0.344
alternative hypothesis: two.sided

> DHARMa::testDispersion(mod, type="PearsonChisq", 
+                        alternative="greater")

    Parametric dispersion test via mean Pearson-chisq statistic

data:  mod
dispersion = 0.0031286, df = 1169, p-value = 1
alternative hypothesis: greater

> emmeans(mod, pairwise ~ Month2)
$emmeans
 Month2 emmean     SE   df lower.CL upper.CL
 Jan      3.11 0.0143 1169     3.08     3.14
 Feb      3.25 0.0155 1169     3.22     3.28
 Mar      3.19 0.0108 1169     3.17     3.21
 Apr      3.17 0.0135 1169     3.15     3.20
 Jul      3.46 0.0153 1169     3.43     3.49
 Aug      3.43 0.0132 1169     3.41     3.46
 Sep      3.38 0.0109 1169     3.36     3.41
 Oct      3.32 0.0155 1169     3.29     3.35

Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

$contrasts
 contrast  estimate      SE   df t.ratio p.value
 Jan - Feb  -0.1446 0.01866 1169  -7.747  <.0001
 Jan - Mar  -0.0825 0.01481 1169  -5.570  <.0001
 Jan - Apr  -0.0646 0.01675 1169  -3.856  0.0031
 Jan - Jul  -0.3538 0.00713 1169 -49.653  <.0001
 Jan - Aug  -0.3268 0.01089 1169 -30.001  <.0001
 Jan - Sep  -0.2760 0.01496 1169 -18.448  <.0001
 Jan - Oct  -0.2115 0.01878 1169 -11.262  <.0001
 Feb - Mar   0.0621 0.01194 1169   5.201  <.0001
 Feb - Apr   0.0800 0.01485 1169   5.389  <.0001
 Feb - Jul  -0.2093 0.01963 1169 -10.662  <.0001
 Feb - Aug  -0.1822 0.01656 1169 -11.003  <.0001
 Feb - Sep  -0.1314 0.01206 1169 -10.896  <.0001
 Feb - Oct  -0.0669 0.01667 1169  -4.014  0.0016
 Mar - Apr   0.0179 0.00955 1169   1.874  0.5692
 Mar - Jul  -0.2714 0.01604 1169 -16.922  <.0001
 Mar - Aug  -0.2443 0.01194 1169 -20.461  <.0001
 Mar - Sep  -0.1935 0.00486 1169 -39.849  <.0001
 Mar - Oct  -0.1290 0.01157 1169 -11.151  <.0001
 Apr - Jul  -0.2893 0.01784 1169 -16.216  <.0001
 Apr - Aug  -0.2622 0.01420 1169 -18.465  <.0001
 Apr - Sep  -0.2114 0.00938 1169 -22.550  <.0001
 Apr - Oct  -0.1469 0.01502 1169  -9.782  <.0001
 Jul - Aug   0.0270 0.01275 1169   2.120  0.4023
 Jul - Sep   0.0778 0.01617 1169   4.815  <.0001
 Jul - Oct   0.1424 0.01979 1169   7.195  <.0001
 Aug - Sep   0.0508 0.01219 1169   4.169  0.0009
 Aug - Oct   0.1153 0.01661 1169   6.944  <.0001
 Sep - Oct   0.0645 0.01253 1169   5.148  <.0001

Results are given on the log (not the response) scale. 
P value adjustment: tukey method for comparing a family of 8 estimates 

Subset of data:

> dput(env[1:10,])
structure(list(use_for_analysis = c("Standard", "Standard", "Standard", 
"Standard", "Standard", "Standard", "Standard", "Standard", "Standard", 
"Standard"), Date = structure(c(1489190400, 1519689600, 1551744000, 
1551571200, 1519776000, 1519689600, 1519603200, 1519689600, 1345507200, 
1551657600), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    CYR = c(2017L, 2018L, 2019L, 2019L, 2018L, 2018L, 2018L, 
    2018L, 2012L, 2019L), Season = structure(c(1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 2L, 1L), levels = c("DRY", "WET"), class = "factor"), 
    Month = c(3, 2, 3, 3, 2, 2, 2, 2, 8, 3), Time = structure(c(-2209049880, 
    -2209049100, -2209049280, -2209049280, -2209047600, -2209046760, 
    -2209046340, -2209044480, -2209019460, -2209047660), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), time2 = structure(c(0.293055555555556, 0.302083333333333, 
    0.3, 0.3, 0.319444444444444, 0.329166666666667, 0.334027777777778, 
    0.355555555555556, 0.645138888888889, 0.31875), format = "h:m:s", class = "times"), 
    DT = structure(c(1489233720, 1519733700, 1551787920, 1551615120, 
    1519821600, 1519736040, 1519650060, 1519738320, 1345577340, 
    1551703140), tzone = "", class = c("POSIXct", "POSIXt")), 
    Site = structure(c(40L, 46L, 27L, 47L, 37L, 45L, 47L, 44L, 
    1L, 37L), levels = c("1", "2", "3", "4", "5", "6", "7", "8", 
    "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", 
    "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", 
    "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", 
    "39", "40", "41", "42", "43", "44", "45", "46", "47"), class = "factor"), 
    temp = c(24.1, 24.7, 26.2, 25.8, 24.8, 24.2, 24.1, 25.7, 
    33.18, 27.5), sal = c(19.82, 33.36, 26.44, 34.98, 21.77, 
    32.4, 32.12, 33.58, 22.08, 30.59), DO = c(2.56, 2.7, 6.8, 
    5.6, 4, 4, 5, 4.5, 9.17, 4.1), water_depth = c(70, 45, 58, 
    110, 76, 75, 65, 65, 63, 69), sed_depth = c(55, 4, 76, 50, 
    47, 36, 25, 105, 127, 53), Month2 = structure(c(3L, 2L, 3L, 
    3L, 2L, 2L, 2L, 2L, 6L, 3L), levels = c("Jan", "Feb", "Mar", 
    "Apr", "Jul", "Aug", "Sep", "Oct", "Dec"), class = "factor")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))
nl101 commented 1 year ago

Probably the wrong forum for this question - not a "problem" with DHARMa package.

florianhartig commented 1 year ago

Hello,

no its perfectly fine to ask questions about residuals here.

There is not indication of a dispersion problem - what sticks out to see is the bimodality for small predictions in the res ~ fitted plot

image

which could be an indication of zero-inflation, or else just an indication that there is a pattern of either higher or lower observed values that is not well-described by the current model.

nl101 commented 1 year ago

Hi Florian.

I'm modelling temperature here (all positive values). Thank you for the comment, it got me thinking this could be seasonality of temperature extremes (rare events like cold fronts and heat waves). Do you think this sort of pattern is an indication that the model is unreasonable to use for pairwise comparisons (it can't accurately predict what temperature will be so the comparisons won't be accurate)?

Which test for dispersion is correct to use here (parametric or non-parametric)? I did two not really knowing which is best. Stupid question -> Zero-inflation is as the name implies, right? I don't need to test for it if I don't have any zeros?

florianhartig commented 1 year ago

I'm modelling temperature here (all positive values). Thank you for the comment, it got me thinking this could be seasonality of temperature extremes (rare events like cold fronts and heat waves).

Why is temp alway > 0, is the Fahrenheit?

Hard to say what the reason is for the bimodality, you have to investigate

Do you think this sort of pattern is an indication that the model is unreasonable to use for pairwise comparisons (it can't accurately predict what temperature will be so the comparisons won't be accurate)?

No, it just means that the residuals are not Gamma distributed for small model predictions.

Which test for dispersion is correct to use here (parametric or non-parametric)? I did two not really knowing which is best.

This is not a dispersion problem, but if you want to test for dispersion, ideal is the DHARMa test on conditional residuals, which, however, is not available with glmmTMB, so I would just stick with the default.

Stupid question -> Zero-inflation is as the name implies, right? I don't need to test for it if I don't have any zeros?

Yes, it looks then what you have is a kind of "low value inflation", i.e. for low predictions, you have high and low values, but not so many observations in-between.

I'm just saying this by looking at the residuals, you should really look at your data!

nl101 commented 1 year ago

Why is temp alway > 0, is the Fahrenheit? Data comes from South FL, haha. Always Sunny :) (Degrees C)

Thank you for the help!

nl101 commented 1 year ago

Yes, it looks then what you have is a kind of "low value inflation", i.e. for low predictions, you have high and low values, but not so many observations in-between. This is really apparent here (Jan.-April=DRY, May-Dec.=WET). My hope was to compare all months, but there may be too much variability so comparisons within the same season might have to do. Seasonality

nl101 commented 1 year ago

I think a GAMM worked better:

Month2 = factor (Jan, Feb., etc..) mod <- gam(temp ~ Month2 + s(time2, bs = "cr") + s(Site, bs = "re") + s(CYR, bs = "re"), family = gaussian, data = env) DHARMa resid