florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
201 stars 21 forks source link

Why uniformity in y direction if we plot against any predictor? #365

Closed tqdo closed 1 year ago

tqdo commented 1 year ago

I am reading the package's vignettes (this section) which explains how to interpret the residuals. Two things we look for in the residuals if the model is correctly specified are:

I understand the 1st point, but can't wrap my head around why the 2nd point is true. Really appreciate any help

florianhartig commented 1 year ago

The key to this is to understand that each single residual is essentially a p-value, and thus uniformly distributed under H0.

If we expect uniform distribution for EACH residual, we expect also that

  1. ALL residuals are uniformly distributed
  2. Any SUBSET of residuals is uniformly distributed

This is what is essentially tested in the standard DHARMa plots - the left plot shows you the joint distribution of all residuals, the right plots shows residuals ordered against a predictor, and if we group residuals according to a particular value of the predictor, they should still be uniform (which is what the second statement you cite refers to)

image

tqdo commented 1 year ago

thanks

tqdo commented 1 year ago

Another related question if you don't mind:

My understanding from your answer is if the model is not fitted correctly, the residuals will not follow a uniform distribution. I did an experiment in which I intentionally omitted a feature that was used to generate y during training. What I observed was: the residuals are very non-uniform when plotted against that missing feature, but the residuals appear to be almost uniform when plotted against a random unrelated feature. This intuitively makes sense to me (the plot suggests that the missing feature can help us explain the response while the random unrelated feature has no value) but I don't get why the residuals would appear to be uniform for that unrelated random feature?

Code in R and plots

` set.seed(666) library(DHARMa)

x1 = rnorm(1000)
x2 = rnorm(1000) z = 1 + 2x1 + 3x2 pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
df = data.frame(y=y,x1=x1,x2=x2)

Omit x2 during training

fittedModel = glm( y~x1,data=df,family="binomial") simulationOutput <- simulateResiduals(fittedModel = fittedModel, plot = F)

Strong deviations from uniformity

plotResiduals(simulationOutput, x2) Screenshot 2023-01-24 at 4 31 49 PM

Minimal deviations from uniformity

plotResiduals(simulationOutput, runif(1000)) Screenshot 2023-01-24 at 4 31 58 PM

`

florianhartig commented 1 year ago

What I state is an implication for H0, so H0 => i.i.d.uniform residuals. From that, it does not follow that !H0 => not uniform, so uniform residuals are not a guarantee that the model is correct, but if you see non-uniformity, you know that that something is wrong. This is the reason why there are so many different plots / tests.

All this is, however, the same for all residual checks - in an OLS, you can also have a perfect QQ plot and then you see a pattern in residual ~ predictor.

So, what you are doing with the residual checks is to perform a number of sanity checks on your model, but that doesn't guarantee that it is correct.

florianhartig commented 1 year ago

See also the section on interpreting residuals in the vignette https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html#interpreting-residuals-and-recognizing-misspecification-problems