florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
200 stars 21 forks source link

How to reject a model with underdispersion but low AIC #422

Open nmueller18 opened 4 days ago

nmueller18 commented 4 days ago

First of all: Many thanks for the great package!

I am trying to model tooth loss among human populations before modern surgery (i. e. extraction at the dentist :-) ). I have a empirical data-set with individuals and tooth-loss as binary variable per tooth-possition (i.e. canine, molars etc.). The most important predicting variable is the age of the respective individual.

In the first run of models, and disregarding the information on tooth-position, the data-set is reduced to the number of lost teeth vs. the number of still observable tooth positions (which vary) per individual. A simple logistic regression (model 1) with binomial probability distribution reveals strong overdispersion:

fit <- glm(formula = cbind(lost_teeth, tooth_positions - lost_teeth) ~ age, family = binomial, data = data_tooth_loss)
AIC(fit)
2123.05
DHARMa::simulateResiduals(fittedModel = fit, plot = T)

logistic_simple

If the individuals are added (model 2), the AIC improves substantially but now the plot shows strong underdispersion. Strangely, the dispersion test is not significant.

fit <- glm(formula = cbind(lost_teeth, tooth_positions - lost_teeth) ~ age + individual_no, family = binomial, data = data_tooth_loss)
AIC(fit)
920.4828
simulationOutput <- DHARMa::simulateResiduals(fittedModel = fit, plot = T)
DHARMa::testDispersion(simulationOutput)

    DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
dispersion = 0.95687, p-value = 0.064
alternative hypothesis: two.sided

logistic_simple_ind

A betabinominal model (model 3) takes care of the overdispersion. I use the function betabin of the package aod for that. Despite the fact that DHARMa provides no built-in support for aod, thanks to the vignette, I was able to add a custom function for creating a DHARMa object. There is no overdispersion anymore and the plot of the residual looks fine in my view.

fit_beta <- aod::betabin(formula = cbind(lost_teeth, tooth_positions - lost_teeth) ~ age, ~ 1, data = data_tooth_loss, link = 'logit')
aod::AIC(fit_beta)
1159.566

simulatebetabin<- function(fitted, n, x){
  theta_param = fitted@random.param
  tp = unlist(x["tooth_positions"])
  pred = aod::predict(fit, type = "response")
  nObs = length(pred)
  sim = matrix(nrow = nObs, ncol = n)
  for(i in 1:n) sim[,i] = FlexReg::rBetaBin(nObs, size = tp, mu = pred, theta = theta_param)
  return(sim)
}
sim = simulatebetabin(fit_beta, 1000, data_tooth_loss)
DHARMaRes = DHARMa::createDHARMa(simulatedResponse = sim, observedResponse = data_tooth_loss$lost_teeth,
                                 fittedPredictedResponse = aod::predict(fit, type = "response"), integerResponse = T)
plot(DHARMaRes, quantreg = T)

betabin

I now face the problem that model 2 has the lowest AIC but that model 3 appears far more elegant and does not suffer from underdispersion. In the vignette and, for example, here you state that underdispersion is less of an issue than overdispersion so I am not sure on what ground I can reject model 2in favour of model 3. Model 2seems clearly overfitted from looking at the figures but is there a measure to read this result from, especially as the dispersion test is not significant?

Any advice is much appreciated!