florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
203 stars 22 forks source link

Sensitivity / power of the dispersion test in DHARMa #117

Open florianhartig opened 5 years ago

florianhartig commented 5 years ago

There have been various questions about differences between results from DHARMa dispersion tests and other parametric overdispersion tests, such as this or performance::check_overdispersion. A typical scenario would be that the parametric test is positive, and the non-parametric (in DHARMa) isn't. Comments about this

What to do with the analysis: first of all, note that if a test is non-significant test, it doesn't mean H0 is true (= no overdispersion), it just means just that there is no proof that H0 is not true (= overdispersion). Accordingly, when one test tells you not sure, and the other tells you sure, I would treat this as being sure that there is a problem. Given that it is usually not very difficult to move to a model that corrects for OD, I would always recommend this course of action in a practical data analysis.

Why is there a difference? The DHARMa tests are very general, because they are based on residual simulations only. It is completely reasonable that a specific parametric tests such as the ones cited above is more sensitive in particular simulations (= higher power). I had generally thought (and this is showed in the vignette with simulations) that power differences are not so different, but I have noted recently that the new DHARMa tests seem to have low power for detecting OD in some situations, in particular with large-variance REs. I will have to run additional test to confirm this and possible move back to the older (and more runtime-intensive) re-simulation tests.

ghost commented 4 years ago

Hi Florian,

I was wondering if you changed anything since 26 july 2019 to increase the power of dharma/make the outcome more reliable. If I run the DHARMa package on my poisson glmer it says that I have no overdispersion, but when I run Ben Bolker's test this is my output: chisq ratio rdf p 1.794360e+03 7.974935e+00 2.250000e+02 2.471052e-242

Because of your post on the sensivity/power of DHARMa and the result of Bolker's test I switched to Negative bionomial, but as this is a less strong analysis i do not get any significant results. So I just wanted to check if anything changed since last year?

florianhartig commented 4 years ago

Hi michielvdglind,

if you're changing to negBin and you don't get a significant result any more, you must be scratching significance anyway, so this would make me hesitant to trust in the poisson result regardless of any dispersion test.

That being said - I have looked at this in some more detail, and one thing I noted is that the dispersion test seems to be more sensitive if lower-level REs are not re-simulated. I'm not sure which package you are using, but have you tried re-running simulateResiduals with the argument re.form = NULL, and then the dispersion test? Should work for lme4, and possibly for glmmTMB.

ghost commented 4 years ago

Hi Florian thanks for the quick reply! I added the argument and indeed the QQ plot is strongly S shaped and the dispersion, outlier and K.S. test all show significance results now.

If I switch to NB without the argument re.form = NULL I still get a significant KS test of 0.046 but the other values and shape of QQ plot are much better. With the argument re.form=NULL the KS test is non significant and the QQ plot fits even better.

Just to clarify, what do you mean with low level residuals exactly, residuals with a low value? If I understand correctly by adding re.form = NULL the simulations include the random effects of my model whereas the default re.form = 0/na excludes the random effects of my model?

THanks again

Istalan commented 4 years ago

Hi michielvdglind,

I've been spending some time double checking the p-value calculations in this package, so i feel i'm qualified to somewhat answer your question.

Let's look at what bolker himslef says about the test: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#testing-for-overdispersioncomputing-overdispersion-factor

We can see that the test only works correctly under specific circumstances like lambda > 5 , which would mean linear predictors > log(5), or something like it. For negative linear predictors one( or a few) unlikely events too many will have a relatively large Pearson residual, which changes the sum of squares massively. Especially so since your degrees of freedom are 225, which isn't very large.

DHARMa using the simulation approach will, more or less, identify the probability of such "distorting" events. Therefore the p-value is not significant, as such events seem to happen with some regularity.

Here's some Code for Pearson residuals for ideal Poisson-models. You can see how "non-normal" the residuals for negative or even just small linear predictors are: (Sorry for German labels, not changing that right now)

require(DHARMa)
require(ggplot2)

make.df.res <- function(model, res.type = "pearson"){
  df.res <- data.frame(model$linear.predictors, resid(model, type = res.type))
  colnames(df.res) <- c("linear_predictors", "res")
  df.res$weight <- model$weights
  return(df.res)
}

set.seed("0816")
testData <- createData(sampleSize = 500, fixedEffects = 2)
model <- glm(observedResponse ~ Environment1 + group, data = testData, family = "poisson")
df_res <- make.df.res(model)

ggplot(data = df_res, aes( x = linear_predictors)) + geom_histogram() + labs(title = "Histogramm der linear predictors eines simulierten Poissonmodels")

k <-min(length(unique(df_res$linear_predictors)), 10)

  ggplot(data = df_res, aes(x = linear_predictors, y =res)) + 
    geom_hline(yintercept = 0) +
    geom_point(aes(), pch = 15, alpha = 0.3 ) + 
    geom_smooth(method = lm, formula = y ~ splines::bs(x, knots = k)) + 
    labs(title = "Pearson Residuen zum simulierten Poissonmodel")

loc_var <- mean(df_res$res^2)

ggplot(data = df_res, aes(x = linear_predictors, y =res^2)) + 
                  geom_hline(aes(yintercept = loc_var)) +
                  geom_smooth(method = lm, formula = y ~ splines::bs(x, knots = k)) + 
                  coord_cartesian(ylim = c(0, loc_var*3)) +
                  labs(title = " quadrierte Pearson Residuen zum simulierten Poissonmodel")

Request: If you could do something similar for your model, i'm sure you could see a few unlikely events on the linear outliers.

Other that that, Florian is of course correct about the models significance seeming untrustworthy.

florianhartig commented 4 years ago

Hi Lukas, thanks for these comments, this is very helpful!

If I may summarise my current thinking about this issue, in the light of these comments:

michielvdglind, for your specific problem, this would suggest to me that:

  1. Given that DHARMa with re.form = NULL flags overdispersion, and given that so much happens with your p-values when you move to negBin, you probably have to accept that you have overdispersion

  2. Overdispersion can be a signal of model misspecification, so before moving to negBin, check your model structure (missing predictors, etc.), especially given that your results indicate that negBin might not be a perfect fit for that data either.

  3. If that doesn't help, move to negBin, or some other model for count data with variable dispersion. The fact that your KS test is significant is not a contradiction, the distribution could still deviate from a negBin, but if your residuals look better, the model is probably preferable. Note that zero-inflation and heteroskedasticity are problems that could always occur in Poisson GLMMs. zero-inflation could lead to a sign. KS test

Istalan commented 4 years ago

Hi Florian,

sorry for the confusion earlier. I did some simulations to check my intuitions. The Code could be turned into functions, but whatever. (sorry for the .txt-file, Github discriminates against .R): Pearson_Pois_dispersiontest (copy).txt

To presented the juiciest result:

#### lambda = 0.05, Tails are way heavier####

lam <- 0.05
n <- 200

sum_r <- replicate(10000, 
                   sum(((rpois(n, lambda = lam) -lam)/sqrt(lam))^2)
)
hist(sum_r)
qqplot(sum_r, rchisq(1000, df = n))

hist(sum_r, probability = T)
curve(dchisq(x, df = n), add = T, col = 2)

hist(sum_r/n, probability = T)
curve(n * dchisq(x*n, df = n), add = T, col = 2)
quantile(sum_r, 0.975)/n # critical value simulation = 1.71
qchisq(0.975, df = n)/n # critical value chi-distribution =  1.205289

I'd say, that this is quite anti-conservative and i think whoever wrote the bold all caps Approximate into the FAQ probably agrees with me. michielvdglinds ratio of 8 however is still too large and lambdas considerably smaller than 0.05 shouldn't be looked at with only about 200 observations.

The p-value of the DHARMa test is just the proportion of more extreme simulations, which should always be fine for correct simulations and a sufficiently large number of them, because the empirical distribution function converges towards the true distribution function.

https://en.wikipedia.org/wiki/Empirical_distribution_function#Asymptotic_properties

Using central convergence to check for 250 simulations and a two.sided test with p = 0.05 we get a 3 sigma confidence interval of 0.04 to 0.06. Here's the code:

p <- 0.025
sig <- p*(1-p)/sqrt(250)
2*(p + c(3, -3)*sig)

I'd call that pretty good.Is it better than the glmm variant? For lambdas smaller than 1?, probably, for lambdas bigger than 5?, no. In between is a bit of a grey zone although the lambda = 1 already looked a lot like a chi-squared distribution.

Regarding the effect of random effects I'm sadly out of my depth.

florianhartig commented 4 years ago

Hi Lukas, thanks for that, super helpful. I'm curious what @bbolker thinks about that?

bbolker commented 4 years ago

I don't have strong feelings. I do know that the "sum of squares of Pearson resids under the null ~ Chi^2(res df)" rule is approximate, I never meant it for more than a rough rule of thumb; happy for something more accurate. I know Venables and Ripley (MASS) make comments about the unreliability of the rule, McCullagh and Nelder probably have stuff to say about it too. I've certainly encountered surprising results (e.g. deviance vs Pearson chi^2 giving very different answers).

Figures and code from MASS:

Screenshot from 2020-06-22 18-48-50 Screenshot from 2020-06-22 18-48-22

ghost commented 4 years ago

@florianhartig

I have not had the time to go through all the comments nor am I entirely sure I would understand everything. But I though the following might help to give some insight.

Not all the p values are actually scratching significance in Poisson, some are very low :

 contrast              estimate     SE  df asymp.LCL asymp.UCL z.ratio p.value
 REF_SPACE - STRIP      -0.1612 0.0732 Inf   -0.3047   -0.0178 -2.204  0.0276 
 REF_SPACE - STRIP_VAR  -0.1472 0.0846 Inf   -0.3129    0.0186 -1.740  0.0818 
 REF_SPACE - STRIP_ADD   0.1273 0.0863 Inf   -0.0418    0.2964  1.476  0.1400 
 REF_SPACE - ROTATION   -0.2125 0.0890 Inf   -0.3869   -0.0381 -2.388  0.0169 
 STRIP - STRIP_VAR       0.0141 0.0428 Inf   -0.0699    0.0980  0.329  0.7423 
 STRIP - STRIP_ADD       0.2886 0.0461 Inf    0.1982    0.3789  6.258  <.0001 
 STRIP - ROTATION       -0.0512 0.0512 Inf   -0.1516    0.0492 -1.000  0.3172 
 STRIP_VAR - STRIP_ADD   0.2745 0.0463 Inf    0.1837    0.3652  5.928  <.0001 
 STRIP_VAR - ROTATION   -0.0653 0.0515 Inf   -0.1662    0.0356 -1.269  0.2044 
 STRIP_ADD - ROTATION   -0.3398 0.0542 Inf   -0.4461   -0.2335 -6.266  <.0001 

Whereas my negative binomial results are not even close to significant

contrast               estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 REF_SPACE - STRIP     -7.92e-02 0.203 Inf    -0.477     0.319 -0.390  0.6962 
 REF_SPACE - STRIP_VAR -1.12e-01 0.237 Inf    -0.576     0.351 -0.475  0.6349 
 REF_SPACE - STRIP_ADD -1.38e-05 0.237 Inf    -0.464     0.464  0.000  1.0000 
 REF_SPACE - ROTATION  -6.88e-02 0.249 Inf    -0.557     0.419 -0.276  0.7825 
 STRIP - STRIP_VAR     -3.31e-02 0.142 Inf    -0.311     0.245 -0.233  0.8157 
 STRIP - STRIP_ADD      7.92e-02 0.143 Inf    -0.201     0.360  0.553  0.5800 
 STRIP - ROTATION       1.05e-02 0.167 Inf    -0.316     0.337  0.063  0.9499 
 STRIP_VAR - STRIP_ADD  1.12e-01 0.146 Inf    -0.173     0.398  0.771  0.4409 
 STRIP_VAR - ROTATION   4.35e-02 0.172 Inf    -0.294     0.381  0.253  0.8003 
 STRIP_ADD - ROTATION  -6.87e-02 0.174 Inf    -0.410     0.272 -0.395  0.6926 

Here also the summary output of my poisson

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: poisson  ( log )
Formula: Total ~ Treatments + (1 | Sampling.date) + (1 | field)
   Data: mydata

     AIC      BIC   logLik deviance df.resid 
  2834.6   2858.7  -1410.3   2820.6      225 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.1552 -1.6571 -0.3199  1.0919 11.0337 

Random effects:
 Groups        Name        Variance Std.Dev.
 Sampling.date (Intercept) 1.2524   1.1191  
 field         (Intercept) 0.1952   0.4418  
Number of obs: 232, groups:  Sampling.date, 8; field, 6

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.34409    0.43584   5.378 7.52e-08 ***
Treatments.L  0.04314    0.06110   0.706 0.480199    
Treatments.Q  0.02585    0.04941   0.523 0.600855    
Treatments.C  0.24970    0.03589   6.958 3.46e-12 ***
Treatments^4  0.11472    0.03121   3.676 0.000237 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) Trtm.L Trtm.Q Trtm.C
Treatmnts.L  0.004                     
Treatmnts.Q -0.007 -0.513              
Treatmnts.C -0.006  0.407 -0.187       
Treatmnts^4 -0.001 -0.080  0.110  0.112

image

And a summary of my data,

Sampling.date        field       Treatments Ground.Beetles    Rove_beetle         Spider         Harvestmen       Ladybeetle    
 Min.   :0002-06-20   1 :56   REF_SPACE:39   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0015-02-03   2 :56   STRIP    :64   1st Qu.:  1.00   1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0019-03-06   3 :56   STRIP_VAR:48   Median :  3.00   Median : 1.000   Median : 2.000   Median :0.0000   Median :0.0000  
 Mean   :0018-05-01   4 : 9   STRIP_ADD:48   Mean   : 11.12   Mean   : 4.763   Mean   : 3.681   Mean   :0.3922   Mean   :0.1466  
 3rd Qu.:0023-10-12   5 :15   ROTATION :33   3rd Qu.: 10.25   3rd Qu.: 5.000   3rd Qu.: 4.250   3rd Qu.:0.2500   3rd Qu.:0.0000  
 Max.   :0030-07-20   10:40                  Max.   :165.00   Max.   :62.000   Max.   :24.000   Max.   :5.0000   Max.   :3.0000  
   Year           Month        Total      
 2018: 87   August   :29   Min.   :  0.0  
 2019:116   July     :87   1st Qu.:  3.0  
 2020: 29   June     :58   Median : 13.0  
            September:58   Mean   : 20.1  
                           3rd Qu.: 26.0  
                           Max.   :172.0  

the analysis is done using model: overPS <- glmer(Total ~ Treatments + (1 | Sampling.date) + (1 | field),data=mydata, family = poisson(link = "log")) I was wondering where to find the lapda, to see if my data and model allow for Bolker's test. The mean of mydata$Total is 20.1, but I guess that this is not the lapda of the model that I build?