easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
971 stars 87 forks source link

Improving check_model() for GLMs #376

Open bwiernik opened 2 years ago

bwiernik commented 2 years ago

The current selection of plots returned by check_model() for GLMs aren't ideal in a few ways.

1. They are missing a linearity check (fitted vs residuals). For binomial models, this should be a called to binned_residuals(). For other families, the standard check is fine. 2. For binomial models, the constant variance plot should be omitted.

  1. For binomial models, the residual QQ plot is hard to interpret. 4. For non-bernoulli models, we should include a plot for checking overdispersion.

For the latter few points, the DHARMa package provides an easy-to-interpret approach for checking distributional assumptions from qq plots and problems with fitted vs residual plots using quantile residuals. We might consider soft-importing DHARMa or re-implementing those approaches. https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html https://github.com/florianhartig/DHARMa/issues/33

mattansb commented 2 years ago

This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types, I'd just pick one version which we think it's the best and stick with it

Agreed.

mccarthy-m-g commented 2 years ago

If you'e interested in implementing DHARMa's approach, you could do something like this:

library(glmmTMB)
library(performance)

#' Check uniformity of GL(M)M's residuals
#'
#' `check_uniformity()` checks generalized linear (mixed) models for uniformity
#' of randomized quantile residuals, which can be used to identify typical model
#' misspecification problems, such as over/underdispersion, zero-inflation, and
#' residual spatial and temporal autocorrelation.
#'
#' @param object Fitted model.
#'
#' @details
#'
#' See `vignette("DHARMa")`
#'
#' @references
#'
#' - Hartig, F., & Lohse, L. (2022). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models (Version 0.4.5). Retrieved from https://CRAN.R-project.org/package=DHARMa
#' - Dunn, P. K., & Smyth, G. K. (1996). Randomized Quantile Residuals. Journal of Computational and Graphical Statistics, 5(3), 236. https://doi.org/10.2307/1390802
#'
#' @return ggplot.
check_uniformity <- function(object) {

  # Simulated residuals; see vignette("DHARMa")
  simulated_residuals <- DHARMa::simulateResiduals(object)

  dp <- list(min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
  ggplot2::ggplot(
    tibble::tibble(scaled_residuals = residuals(simulated_residuals)),
    ggplot2::aes(sample = scaled_residuals)
  ) +
    qqplotr::stat_qq_band(distribution = "unif", dparams = list(min = 0, max = 1), alpha = .2) +
    qqplotr::stat_qq_line(distribution = "unif", dparams = dp, size = .8, colour = "#3aaf85") +
    qqplotr::stat_qq_point(distribution = "unif", dparams = dp, size = .5, alpha = .8, colour = "#1b6ca8") +
    ggplot2::labs(
      title = "Uniformity of Residuals",
      subtitle = "Dots should fall along the line",
      x = "Standard Uniform Distribution Quantiles",
      y = "Sample Quantiles"
    ) +
    see::theme_lucid()
}

data("Salamanders")
m <- glmmTMB(
  count ~ mined + spp + (1 | site),
  family = poisson,
  data = Salamanders
)

check_uniformity(m)

Created on 2022-06-17 by the reprex package (v2.0.1)

strengejacke commented 1 year ago

@mccarthy-m-g suggestion looks rather easy to implement.

mccarthy-m-g commented 1 year ago

@strengejacke What would a PR for this involve? I could get a draft started if this is the solution you want to go for.

bwiernik commented 1 year ago

Let's call the function check_residuals()

@mccarthy-m-g You can add the function to a new checkresiduals.R file here in the performance package. Take a look at one of the other check functions like check_normality.R for an example of the documentation syntax and structure.

Then open a PR here and we can merge it in. After that, then we can move over to the see package repo and add the plotting function there.

mccarthy-m-g commented 1 year ago

Hi all, I just opened a new issue to discuss the implementation for check_residuals() (#595). There are a few things that should be resolved before getting a PR started.

strengejacke commented 4 months ago

This is the current development stage. We see a mismatch between the tests based on simulated residuals and generated plots for following families/models:

library(performance)
library(glmmTMB)
library(readr)
docvisit <- read_table2("C:/Users/Daniel/Downloads/docvisit.txt")

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mp)
out
#> # Overdispersion test
#> 
#>        dispersion ratio =    1.808
#>   Pearson's Chi-Squared = 9375.539
#>                 p-value =  < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mnb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.005
#>           p-value = 0.816
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mzip)
out
#> # Overdispersion test
#> 
#>  dispersion ratio =   1.417
#>           p-value = < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.031
#>           p-value =  0.64
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinbd)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.133
#>           p-value = 0.104
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Created on 2024-03-17 with reprex v2.1.0

Looks like "nbinom2()" is currently inaccurate, the code we use is here: https://github.com/easystats/performance/blob/35b5e19988386b584d91116be542baca1e98f33f/R/check_model_diagnostics.R#L370

(also pinging @bbolker and cross referencing to #654)