florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
201 stars 21 forks source link

Model comparison / overdispersion comparison via DHARMa residual tests #356

Open florianhartig opened 1 year ago

florianhartig commented 1 year ago

Via email:

I have a question about interpretation of p-value obtained from DHARMa function tests, such as overdispersion or uniformity test.

Suppose, I am running two models differing by their error structure (m1 and m2) on the same dataset (constant sample size) and that the p-value extracted from testDispersion() is 0.2 for m1 and 0.9 for m2.

I am unclear on whether the hypothetical p-value of 0.2 for m1 and 0.9 for m2 indicate whether:

i) m2 scaled residuals have less overdispersion than m1 and all other things being equal m2 should be preferred

or

ii) No inference about the degree of overdispersion in m1 and m2 can be made from the differences in their respective. p-values only indicate that, for both m1 and m2, the variance of simulated residuals and scaled residuals does not significantly differ.

Some parts of section 3) from the General remarks on interpreting residual patterns and tests" in the DHARMa vignette, makes me think that 2) is correct. However, the fact that the dataset on which both models are run is the same, and that p-value are dependent on strength of effect and sample size, makes me wonder whether 1) is correct

florianhartig commented 1 year ago

p-values are not effect sizes, but I suppose you could conclude that if you were to estimate an unknown overdispersion parameter, m1 would likely have a larger value than m2, all other things equal.

The question, however, is: why are you interested in this parameter. If the goal is to compare models m1,m2 (as you seem to suggest), you should not use this value. The reason is that, in general, goodness of fit statistics such as R2, overdispersion etc. are not suitable to compare between models, because more complex models generally fit better, i.e. the problem is that they do not correct for model complexity. Thus, DHARMa GOF tests primarily tell you if a models is compatible with the data. Based on your p-values, that is true for both models m1, m2, but they should not be used to compare!

For model selection, use tools like AUC or likelihood ratio tests. For mixed models, the problem here is that the df are often not clear, so you have to be careful. DHARMa has a function for a simulated likelihood ratio test https://rdrr.io/cran/DHARMa/man/simulateLRT.html that circumvents this problem also work for comparing models with different variance structures.

pv272 commented 1 year ago

Your answer makes me realize my question was nonsensical.

I first had a look at goodness of fit (using AIC) but then started investigating dispersion and uniformity tests in isolation out of curiosity. Along the way, I somehow started using these tests values for model selection (beyond significant p-values indicating models not being compatible with data), instead of relying on goodness of fit measures.

If I understand correctly, the first sentence of your reply suggest that although p-value are not measures of effect size, they would reflect effect size ALL OTHER THINGS BEING EQUAL?