Power of the KS test - Githubissues

florianhartig / DHARMa

Diagnostics for HierArchical Regession Models

214 stars 22 forks source link

via Email:

der KS-Test wie auch die uniform-ecdf-Methode eher unterpowert sind, d.h., man geht zu spät von Abweichungen im Modell aus. Da ich auch heute noch häufig lineare mixed models (also LMM) benutze und dort die G-o-f bekanntlich mit Normalitäts-Tests gut dokumentiert ist, habe ich mich gefragt, ob es für lmer-Objekte (oder lme) möglich wäre, auch den Shapiro-Wilk test auf den untransformierten Daten zu benutzen. Dieser hat bekanntlich die höchste Power und detektiert Abweichungen auch bei kleinem Umfang recht schnell.

Im Anhang habe ich Ihnen mal eine kleine Simulation und ein paar Beispiele erstellt dazu.

library(lme4)
library(DHARMa)
library(ddpcr)

# good case
###########

# n = 16 observations per group (N = 32)
n <- 16
obs <- 1:(2 * n)
group <- rep(c("A", "B"), each = n)
subject <- rep(1:(n / 2), 4)

# all normal
set.seed(11)
y <- 
  c(rnorm(n, 10, 1), rnorm(n, 5, 1)) + # residuals in groups normally distributed
  rep(rnorm(n / 2, 1, 1), 4) # random effect normally distributed
df <- data.frame(group, subject, y)

fit <- lmer(y ~ group + (1 | subject), data = df)
summary(fit) # all good

shapiro.test(resid(fit)) # non-significant
ks.test(resid(fit), "pnorm") # non-significant
simulationOutput <- simulateResiduals(fittedModel = fit, plot = T, use.u = T, n = 2000) # non-significant 

ks.test(residuals(simulationOutput), "punif") # as in simulateResiduals

# bad case
###########

# normality violated
set.seed(11)
y <- 
  c(rnorm(n, 10, 1), rnorm(n, 5, 1)) + 
  rep(rnorm(n / 2, 1, 1), 4) + 
  rt(n, df = 1) # noise term
df <- data.frame(group, subject, y)

fit2 <- lmer(y ~ group + (1 | subject), data = df)
summary(fit2) # wrong random effects

shapiro.test(resid(fit2)) # significant
ks.test(resid(fit2), "pnorm") # non-significant
simulationOutput <- simulateResiduals(fittedModel = fit2, plot = T, use.u = T, n = 2000) # non-significant

ks.test(residuals(simulationOutput), "punif") # as in simulateResiduals

# power simulation bad case
###########################

n <- 16
obs <- 1:(2 * n)
group <- rep(c("A", "B"), each = n)
subject <- rep(1:(n / 2), 4)

sw_norm <- NA
any_norm <- NA
ks_unif <- NA
any_unif <- NA

for(i in 1:500) {

  if(i %% 50 == 0) { print(i) }

  y <- 
    c(rnorm(n, 10, 1), rnorm(n, 5, 1)) + 
    rep(rnorm(n / 2, 1, 1), 4) + 
    rt(n, df = 1) # noise term
  df <- data.frame(group, subject, y)

  quiet(fit2 <- lmer(y ~ group + (1 | subject), data = df))

  # shapiro-wilk
  sw_norm[i] <- shapiro.test(resid(fit2))$p.value <= 0.05
  try(any_norm[i] <- shapiro.test(resid(fit2))$p.value <= 0.05 | shapiro.test(ranef(fit2)$subject[, 1])$p.value <= 0.05, silent = T)

  simulationOutput <- simulateResiduals(fittedModel = fit2, plot = F, use.u = T, n = 200) # non-significant
  quiet(tR <- testResiduals(simulationOutput, plot = F))

  ks_unif[i] <- tR$uniformity$p.value <= 0.05
  any_unif[i] <- tR$uniformity$p.value <= 0.05 | tR$dispersion$p.value <= 0.05 | tR$outliers$p.value <= 0.05

}

mean(sw_norm)
mean(any_norm, na.rm = T)
mean(ks_unif)
mean(any_unif)

It's a good point that the KS test has lower power, although I would note that generally, my view is that the tests are just a help for interpreting the plots (see comments in the vignette https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html#general-remarks-on-interperting-residual-patterns-and-tests), and should not be interpreted as the main decision criterion on whether there is a problem, so one could argue about how important the loss of power through the KS test is.

In any case, it's reasonable to choose the test with the highest power available, so let's say we are concerned about this and would like to switch to shapiro or some other parametric test to improve power:

1) First of all, note that DHARMa already offers the option to transform the residuals to any other distribution, including normal, in the function DHARMa::residuals. So you could choose normal here and then test use the shapiro test. There is an ongoing discussion about whether uniform or normal residuals are better in DHARMa, and what should be the default, and it's on our list to look at this in more detail https://github.com/florianhartig/DHARMa/issues/39

2) One of the reason I'm not doing the normal transformation by default is that with the normal distribution, there's the problem where to back-transform points that are outside the simulation envelope (at thus at a ecdf value close to 0 or 1). However, note that in the case of lmer, there would actually not be any need to do simulations, you could get the ecdf position analytically (and then back-transform to normal, which, however, would be the same to just use the residuals from lmer). The reason I'm always simulating in DHARMa is a pure practical one, I just don't want to program a large set of rules when to use simulation-based and when to use analytical ecdf values. So long story short: if you want analytical values with higher power, just use the default residuals. I have also considered whether it would be useful to provide an option to read in such default residuals in DHARMa and give the user an option to just use the plots / tests. In such a case, you could provide the residuals and the lmer object, and DHARMa wouldn't simulate but rather use the default residuals() function but people could still use the plots https://github.com/florianhartig/DHARMa/issues/408

3) In case we stay with uniform: a question one could look into is if there is a parametric test for uniformity with a similar power as the shapiro test. I have to say that I never researched this and just chose the KS as the first convenient thing that came to mind

Cheers

Florian

florianhartig / DHARMa

Power of the KS test #409