easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
967 stars 87 forks source link

Improving check_model() for GLMs #376

Open bwiernik opened 2 years ago

bwiernik commented 2 years ago

The current selection of plots returned by check_model() for GLMs aren't ideal in a few ways.

1. They are missing a linearity check (fitted vs residuals). For binomial models, this should be a called to binned_residuals(). For other families, the standard check is fine. 2. For binomial models, the constant variance plot should be omitted.

  1. For binomial models, the residual QQ plot is hard to interpret. 4. For non-bernoulli models, we should include a plot for checking overdispersion.

For the latter few points, the DHARMa package provides an easy-to-interpret approach for checking distributional assumptions from qq plots and problems with fitted vs residual plots using quantile residuals. We might consider soft-importing DHARMa or re-implementing those approaches. https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html https://github.com/florianhartig/DHARMa/issues/33

bbolker commented 2 years ago

One standard check for bernoulli models is to do a binned binomial proportions plot (this is, loosely, a graphical equivalent of the Hosmer-Lemeshow test, and is implemented in arm::binnedplot()

bwiernik commented 2 years ago

Yep, that's implemented in performance as binned_residuals()

bbolker commented 2 years ago

ooops, I missed your comment about binned_residuals(). D'oh!

bwiernik commented 2 years ago

No worries! Thanks for your comment!

IndrajeetPatil commented 2 years ago

To be fair, though, it is kind of difficult to know the existence of this visual check for two reasons:

library(performance)

model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")

x <- check_model(model)
names(x)
#> [1] "VIF"         "QQ"          "HOMOGENEITY" "OUTLIERS"    "INFLUENTIAL"
print(binned_residuals(model))
#> Warning: Probably bad model fit. Only about 50% of the residuals are inside the error bounds.

plot(binned_residuals(model))
#> Warning: Probably bad model fit. Only about 50% of the residuals are inside the error bounds.

Created on 2021-12-09 by the reprex package (v2.0.1)

DominiqueMakowski commented 2 years ago

Should we incorporate the plot into check_model?

bwiernik commented 2 years ago

Yes, like I said in the first post, I think we need to change most of the plots in check_model() for GLMs.

bwiernik commented 2 years ago

I think we also might want to add a check_predictions() plot to check_model() for all models, perhaps dropping the normal density plot?

bbolker commented 2 years ago

This isn't the place, but I may have some Opinions about binned_model (I would like a smoothed line added, binomial CIs on the points; I would rather have binomial CIs on the individual points and a confidence interval on the smooth rather than have a dichotomizing region that indicates 'good' and 'bad' points ...). I see that this is copying what arm::binnedplot does though ...

bwiernik commented 2 years ago

I like that. Have an example?

bbolker commented 2 years ago

Not yet but could do one tomorrow.

bwiernik commented 2 years ago

That would be great!

jebyrnes commented 2 years ago

Would it be easier to pull in https://github.com/florianhartig/DHARMa as its very flexible in terms of model type?

bbolker commented 2 years ago

https://gist.github.com/bbolker/945e684dc148239364ab0f99c9a1beed

I realized that the residuals from binomials are not binomial (so constructing binomial CIs isn't useful); however I think that a binned plot using mean_cl_boot() to generate CIs is probably fine. See examples and decide what you think is useful or not ...

bwiernik commented 2 years ago

I think it would be good to put check_predictions() (pp_check) into the this plot array. For linear models, we have two normality checking plots. I think it would be good to swap out the density plot for a posterior predictions plot. Thoughts?

strengejacke commented 2 years ago

For binomial models, the constant variance plot should be omitted.

Which panel do you exactly mean? I think internally, the names are different from plot titles. I assume you mean the plot in the top right panel?

image

bwiernik commented 2 years ago

Yes, the top right one.

strengejacke commented 2 years ago

Current state:

library(performance)
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
check_model(model)

Created on 2022-02-25 by the reprex package (v2.0.1)

strengejacke commented 2 years ago

I think we also might want to add a check_predictions() plot to check_model() for all models, perhaps dropping the normal density plot?

Only the predictions, not the range, right? (i.e. not what would be returned when check_predictions(model, check_range = TRUE), because that would be too much for such little space?)

strengejacke commented 2 years ago

@strengejacke TODO: use same colours for binned residuals as in other plots (e.g. blue/red like in VIF)

bwiernik commented 2 years ago

Only the predictions, not the range, right?

Yeah I agree.

I would say put predictions first, followed by linearity, homogeneity, outliers, normality, then multicollinearity

I really don't like multicollinearity checks at all, but can understand it would controversial to remove this plot from the array 😜

jebyrnes commented 2 years ago

What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.

strengejacke commented 2 years ago

What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.

Do you mean for binomial models?

jebyrnes commented 2 years ago

Yes? Or any glm. Also meant quantile not quartile. Stupid autocorrect. Ha!

strengejacke commented 2 years ago
library(performance)
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
check_model(model)

m <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars)
check_model(m)

Created on 2022-02-25 by the reprex package (v2.0.1)

strengejacke commented 2 years ago

@bwiernik should we change the labs for the pp-check panel? First, y and yrep don't sound clear to me. Second, if colours are changed, lines are probably no longer blue and grey. And there's no y and x axis title, leading to empty spaces (which is, however, aligned with the other plots).

bwiernik commented 2 years ago

Yeah, I can clarify the labels to match the other plots style

strengejacke commented 2 years ago

For non-bernoulli models, we should include a plot for checking overdispersion.

We don't have such a plot right now...

strengejacke commented 2 years ago

(comment moved to https://github.com/easystats/performance/issues/382)

bwiernik commented 2 years ago

We don't have such a plot right now...

I've seen 3 types of plots for diagnosing over dispersion. One plots conditional mean against conditional variance directly:

image

Another uses the typical scale-location plot. For this, we could just change the title of the homogeneity plot when it is used for a relevant GLM.

image

The other is DHARMa, which uses simulations to make qq plots and predicted vs residual plots:

image

Thoughts?

bwiernik commented 2 years ago

Thoughts on the above 3 plots? @strengejacke @DominiqueMakowski @IndrajeetPatil @mattansb @vincentarelbundock

strengejacke commented 2 years ago

For me, the first looks most intuitive

jebyrnes commented 2 years ago

I like both the first as well as the QQ plot. The residual v. predicted with 3 lines has never really made sense to me.

bbolker commented 2 years ago

You didn't ask me, but:

  1. I think this kind of plot is most useful for diagnosing/eye-balling the variance-mean relationship (e.g. do the raw/un-pearsonized residuals most resemble Poisson-type (var ~ mean), NB1 (var = c*mean, c>1), NB2 (var = mu*(1+mu/k), k>0), Gamma/log-normal (var = c*mean^2), ... ? It would be even better if it were on a log-log scale so we could compare the power law of the mean-variance scaling to various reference lines (i.e. var prop to mean, prop to mean^2, etc.).
  2. (a) The left-hand plot here puzzles me; not sure what's on the axes? (b) I actually like this one (right-hand plot) best (in perfect contrast to @jebyrnes), if the multiple lines are explained. In my experience people usually use the scale-location plot only to detect trends in scale, but there's no reason it couldn't be calibrated to set the expected scale/y-axis value to 1 ...
  3. I don't think the Q-Q plot is typically useful/used for diagnosing overdispersion. DHARMa does a test for overdispersion (I don't remember how) and reports it on the Q-Q panel, but normally the Q-Q plot is only used for agreement in shape (i.e., we only look to see if the pattern is linear, rather than comparing the slope to an expected theoretical slope — this is in parallel with the point above that scale-location plots are usually used to compare pattern, not magnitude ...

(This discussion reminds me of the Cullen and Frey skew-kurtosis graph illustrated here (fitdistrplus::descdist()), which I find interesting but not actually very useful ...

bwiernik commented 2 years ago

The qqplots can be useful for overdispersion because that shape with fat tails is a characteristic pattern of overdispersion. That said, I'm really not a fan of qqplots in general, and I agree that other plots like posterior predictive densities and fitted-residuals plots are probably more useful

mattansb commented 2 years ago

I also like the first one (:

I don't think that (most people) can diagnose dispersion by looking a a qq-plot, but maybe that's just me (:

strengejacke commented 2 years ago

@bwiernik How can we create the first plot variance versus expected)? We may include this in the forthcoming update.

bwiernik commented 2 years ago

tl;dr I think we should go with the second plot below. What do you think?

Okay, so I looked into what SAS was doing with that first plot. Super disappointing. It's actually just plotting the model-predicted mean agains the model-predicted variance (so for a Poisson model, always a straight line). So there's no actual observed data there, only predictions. I wouldn't even call it a diagnostic plot -- you need to have already fit a model with over dispersion before it will confirm that, yes, there was over dispersion that needed to handled.

So I played around and came up with two ideas

image

In the top plot, we show the standardized residuals. These should follow a normal distribution, so most points should be within ±2 and almost no points should be outside ±4. There is a pretty big number larger that +2.

The second plot takes the squared residuals and fits a geom_smooth() to them. This gives us the observed average squared residual (i.e., variance) for each Predicted value. It then shows the comparison against the model-predicted variance. In this plot, we see that the observed variance is much higher than expected at a bunch of ranges.

I think the second plot is probably better. It doesn't have individual points, but it is still based on the actual observed residuals, and I think it more clearly indicates the violation of the assumption.

What do you think?

Here is an example of the above data with a model better handling the dispersion (negative binomial, zero-inflated, with a variable dispersion model):

image

bwiernik commented 2 years ago

@bbolker Either of the above require computation of Pearson residuals and/or model-expected conditional variance. Do you think that a function could be added for the conditional variance in glmmTMB for the various count families, including with zero-inflation and variable dispersion (e.g., for nbinom2, V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob))?

strengejacke commented 2 years ago

2nd plot looks great, and is easy to understand even for non-experts. Can you share some code, so I can implement it in performance (and see), or do you want to do that?

bwiernik commented 2 years ago

I'll through some code up

bwiernik commented 2 years ago

docvisit.txt

docvisit <- readr::read_table(file.path("docvisit.txt"))

docvisit

library(glmmTMB)

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)

mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)

mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)

mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)

mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)

ep <- modelbased::estimate_expectation(mp) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = 0,
    Prob = 0,
    V = Predicted,
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = residuals(mp, type = "pearson")
  )

enb <- modelbased::estimate_expectation(mnb) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mnb, type = "disp"),
    Prob = 0,
    V = Predicted * (1 + Predicted / Disp),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = residuals(mnb, type = "pearson")
  )

ezip <- modelbased::estimate_expectation(mzip) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = 0
    Prob = predict(mzip, type = "zprob"),
    V = Predicted * (1 - Prob) * (1 + Predicted * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

ezinb <- modelbased::estimate_expectation(mzinb) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mzinb, type = "disp"),
    Prob = predict(mzinb, type = "zprob"),
    V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

ezinbd <- modelbased::estimate_expectation(mzinbd) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mzinbd, type = "disp"),
    Prob = predict(mzinbd, type = "zprob"),
    V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

p1 <- ggplot(ezinbd) + aes(x = Predicted) + 
  geom_point(aes(y = StdRes)) + 
  geom_hline(
    yintercept = c(-2, 2, -4, 4), 
    linetype = c("solid", "solid", "dashed", "dashed"),
    color = c(rep(see::social_colors("green"), 2), rep(see::social_colors("blue"), 2))
  ) +
  labs(
    title = "Overdispersion and zero-inflation", 
    subtitle = "Most points should be within ±2 (green), few points outside ±4 (blue)",
    x = "Predicted mean",
    y = "Standardized resiuduals"
  ) +
  see::theme_modern(plot.title.space = 2)

p2 <- ggplot(ezinbd) + aes(x = Predicted) + 
  geom_smooth(aes(y = V), color = see::social_colors("blue"), se = FALSE) + 
  geom_smooth(aes(y = Res2), color = see::social_colors("green")) + 
  labs(
    title = "Overdispersion and zero-inflation", 
    subtitle = "Observed residual variance (green) should follow predicted residual variance (blue)",
    x = "Predicted mean",
    y = "Residual variance"
  ) +
  see::theme_modern(plot.title.space = 2)

p1 / p2  
strengejacke commented 2 years ago

ok, first model types are supported (missing will be added later). Some tweaking necessary, e.g. correct colors:

library(glmmTMB)
library(performance)
data("Salamanders")
m <- glmmTMB(
  count ~ mined + spp + (1 | site),
  family = poisson,
  data = Salamanders
)
check_model(m)
#> `check_outliers()` does not yet support models of class glmmTMB.

Created on 2022-03-16 by the reprex package (v2.0.1)

bwiernik commented 2 years ago

Awesome. Can you link to where this is implemented?

strengejacke commented 2 years ago
library(readr)
docvisit <- read_table2("C:/Users/DL/Downloads/docvisit.txt")
#> Warning: `read_table2()` was deprecated in readr 2.0.0.
#> Please use `read_table()` instead.
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   sex = col_double(),
#>   age = col_double(),
#>   agesq = col_double(),
#>   income = col_double(),
#>   levyplus = col_double(),
#>   freepoor = col_double(),
#>   freerepa = col_double(),
#>   illness = col_double(),
#>   actdays = col_double(),
#>   hscore = col_double(),
#>   chcond1 = col_double(),
#>   chcond2 = col_double(),
#>   doctorco = col_double(),
#>   nondocco = col_double(),
#>   hospadmi = col_double(),
#>   hospad = col_double(),
#>   medicine = col_double(),
#>   prescrib = col_double(),
#>   nonpresc = col_double()
#> )

library(glmmTMB)
#> Warning: package 'glmmTMB' was built under R version 4.1.3
library(performance)

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)
plot(check_overdispersion(mp))
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)
plot(check_overdispersion(mnb))
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)
plot(check_overdispersion(mzip))
#> Warning: Cannot test for overdispersion, because pearson residuals are not implemented
#>   for models with zero-inflation or variable dispersion.
#>   Only the visual inspection using 'plot(check_overdispersion(model))' is
#>   possible.
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)
plot(check_overdispersion(mzinb))
#> Warning: Cannot test for overdispersion, because pearson residuals are not implemented
#>   for models with zero-inflation or variable dispersion.
#>   Only the visual inspection using 'plot(check_overdispersion(model))' is
#>   possible.
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)
plot(check_overdispersion(mzinbd))
#> Warning: Cannot test for overdispersion, because pearson residuals are not implemented
#>   for models with zero-inflation or variable dispersion.
#>   Only the visual inspection using 'plot(check_overdispersion(model))' is
#>   possible.
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Created on 2022-03-16 by the reprex package (v2.0.1)

strengejacke commented 2 years ago

Awesome. Can you link to where this is implemented?

This works for check_model(), not yet for plot() (because it requires the forthcoming update from see on CRAN). I'll commit the latest changes to working PR branches, so you can reproduce the plots above.

strengejacke commented 2 years ago

Ok, you need the latest insight from GitHub/r-universe, performance from this PR (https://github.com/easystats/performance/pull/404), and latest see from GitHub / r-universe. Then, the above examples work. If you install performance from master instead of PR, "only" the check_model() including overdispersion plots work.

strengejacke commented 2 years ago

In future versions, we'll be able to have both plot-types for overdispersion...

library(glmmTMB)
library(performance)
data("Salamanders")
m <- glmmTMB(
  count ~ mined + spp + (1 | site),
  family = poisson,
  data = Salamanders
)
check_model(m)

check_model(m, plot_type = 2)

DominiqueMakowski commented 2 years ago

This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types, I'd just pick one version which we think it's the best and stick with it (until the next improvement)... or, if they show different stuff, have the two overdispersion plots side by side in the corner (subplots)?

strengejacke commented 2 years ago

Yeah, I think for now we stick to plot 1, the default. I just implemented that option, which is undocumented for now, bnut we could make it in a similar way as with normality plots, where we have 3 different types: https://easystats.github.io/see/articles/performance.html#check-for-normal-distributed-residuals