Improving check_model() for GLMs

bwiernik commented 2 years ago

The current selection of plots returned by check_model() for GLMs aren't ideal in a few ways.

~~1. They are missing a linearity check (fitted vs residuals). For binomial models, this should be a called to binned_residuals(). For other families, the standard check is fine.~~ ~~2. For binomial models, the constant variance plot should be omitted.~~

For binomial models, the residual QQ plot is hard to interpret. ~~4. For non-bernoulli models, we should include a plot for checking overdispersion.~~

For the latter few points, the DHARMa package provides an easy-to-interpret approach for checking distributional assumptions from qq plots and problems with fitted vs residual plots using quantile residuals. We might consider soft-importing DHARMa or re-implementing those approaches. https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html https://github.com/florianhartig/DHARMa/issues/33

mattansb commented 2 years ago

This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types, I'd just pick one version which we think it's the best and stick with it

Agreed.

mccarthy-m-g commented 2 years ago

If you'e interested in implementing DHARMa's approach, you could do something like this:

library(glmmTMB)
library(performance)

#' Check uniformity of GL(M)M's residuals
#'
#' `check_uniformity()` checks generalized linear (mixed) models for uniformity
#' of randomized quantile residuals, which can be used to identify typical model
#' misspecification problems, such as over/underdispersion, zero-inflation, and
#' residual spatial and temporal autocorrelation.
#'
#' @param object Fitted model.
#'
#' @details
#'
#' See `vignette("DHARMa")`
#'
#' @references
#'
#' - Hartig, F., & Lohse, L. (2022). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models (Version 0.4.5). Retrieved from https://CRAN.R-project.org/package=DHARMa
#' - Dunn, P. K., & Smyth, G. K. (1996). Randomized Quantile Residuals. Journal of Computational and Graphical Statistics, 5(3), 236. https://doi.org/10.2307/1390802
#'
#' @return ggplot.
check_uniformity <- function(object) {

  # Simulated residuals; see vignette("DHARMa")
  simulated_residuals <- DHARMa::simulateResiduals(object)

  dp <- list(min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
  ggplot2::ggplot(
    tibble::tibble(scaled_residuals = residuals(simulated_residuals)),
    ggplot2::aes(sample = scaled_residuals)
  ) +
    qqplotr::stat_qq_band(distribution = "unif", dparams = list(min = 0, max = 1), alpha = .2) +
    qqplotr::stat_qq_line(distribution = "unif", dparams = dp, size = .8, colour = "#3aaf85") +
    qqplotr::stat_qq_point(distribution = "unif", dparams = dp, size = .5, alpha = .8, colour = "#1b6ca8") +
    ggplot2::labs(
      title = "Uniformity of Residuals",
      subtitle = "Dots should fall along the line",
      x = "Standard Uniform Distribution Quantiles",
      y = "Sample Quantiles"
    ) +
    see::theme_lucid()
}

data("Salamanders")
m <- glmmTMB(
  count ~ mined + spp + (1 | site),
  family = poisson,
  data = Salamanders
)

check_uniformity(m)

^{Created on 2022-06-17 by the reprex package (v2.0.1)}

strengejacke commented 1 year ago

@mccarthy-m-g suggestion looks rather easy to implement.

mccarthy-m-g commented 1 year ago

@strengejacke What would a PR for this involve? I could get a draft started if this is the solution you want to go for.

bwiernik commented 1 year ago

Let's call the function check_residuals()

@mccarthy-m-g You can add the function to a new checkresiduals.R file here in the performance package. Take a look at one of the other check functions like check_normality.R for an example of the documentation syntax and structure.

Then open a PR here and we can merge it in. After that, then we can move over to the see package repo and add the plotting function there.

mccarthy-m-g commented 1 year ago

Hi all, I just opened a new issue to discuss the implementation for check_residuals() (#595). There are a few things that should be resolved before getting a PR started.

strengejacke commented 4 months ago

This is the current development stage. We see a mismatch between the tests based on simulated residuals and generated plots for following families/models:

mnb / nbinom2() (plot suggests underdispersion)
mzinb / nbinom2() with ZI (plot suggests underdispersion)

library(performance)
library(glmmTMB)
library(readr)
docvisit <- read_table2("C:/Users/Daniel/Downloads/docvisit.txt")

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mp)
out
#> # Overdispersion test
#> 
#>        dispersion ratio =    1.808
#>   Pearson's Chi-Squared = 9375.539
#>                 p-value =  < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mnb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.005
#>           p-value = 0.816
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mzip)
out
#> # Overdispersion test
#> 
#>  dispersion ratio =   1.417
#>           p-value = < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.031
#>           p-value =  0.64
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinbd)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.133
#>           p-value = 0.104
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

^{Created on 2024-03-17 with reprex v2.1.0}

Looks like "nbinom2()" is currently inaccurate, the code we use is here: https://github.com/easystats/performance/blob/35b5e19988386b584d91116be542baca1e98f33f/R/check_model_diagnostics.R#L370

(also pinging @bbolker and cross referencing to #654)

easystats / performance

Improving check_model() for GLMs #376