gavinsimpson / gratia

ggplot-based graphics and useful functions for GAMs fitted using the mgcv package
https://gavinsimpson.github.io/gratia/
Other
206 stars 28 forks source link

A better model summary #214

Open noamross opened 1 year ago

noamross commented 1 year ago

I think gratia would be a nice home for an alternative GAM model summary table that would be a reasonable first pass for including in a paper or at least a supplement for common models, and be amenable to modification. Things that I think would be good to have:

I'm doing this for myself, and would be happy to start a PR based on it.

gavinsimpson commented 1 year ago

There's currently an overview() that does some of this:

library("gratia")
library("mgcv")

eg1 <- data_sim("eg1", n = 1000,  dist = "normal", scale = 2, seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), 
         data = eg1, 
         method = "REML")

overview(m)
> overview(m)

Generalized Additive Model with 4 terms

  term  type    edf statistic p.value
  <chr> <chr> <dbl>     <dbl> <chr>  
1 s(x0) TPRS   4.23    16.1   < 0.001
2 s(x1) TPRS   3.50   169.    < 0.001
3 s(x2) TPRS   8.61   201.    < 0.001
4 s(x3) TPRS   1.00     0.599 0.43875

It currently doesn't return estimates for parametric terms:

eg4 <- data_sim("eg4", n = 400,  dist = "normal", scale = 2, seed = 1)
m_by <- gam(y ~ fac + s(x2, by = fac) + s(x0), 
            data = su_eg4, method = "REML")
overview(m_by)
> overview(m_by)

Generalized Additive Model with 5 terms

  term       type         edf statistic p.value  
  <chr>      <chr>      <dbl>     <dbl> <chr>    
1 fac        parametric  2     150.     < 0.001  
2 s(x2):fac1 TPRS        2.85    4.62   0.0021453
3 s(x2):fac2 TPRS        2.95   30.9    < 0.001  
4 s(x2):fac3 TPRS        6.49   30.9    < 0.001  
5 s(x0)      TPRS        1.00    0.0168 0.8996412

It wouldn't be hard to have "parametric" be "factor" or "continuous" say, but I hadn't thought about returning the actual estimates (and the constant term is missing here).

What do you mean by Knots? The number of basis functions used? That would be easy to add.

I'm not a big fan of stars on p values, so I'd resist including them.

Expanding on the basis names (type) could be done - I was trying to avoid having the output be too wide, but my short-names are not very consistent (some aren't very short for some bases).

The above output is a plain tibble with some custom formatting rather than a specific print method: https://github.com/gavinsimpson/gratia/blob/HEAD/R/overview.R

Is this (sort of) what you had in mind?

noamross commented 1 year ago

I did not know about overview()! Yes, this is a way towards what I'm looking for. I do think anything that makes it easier for people to see and report effect sizes is important - they're under-reported even in linear models and really rare/hard for people to extract with GAMs. Yes, by Knots I mean k - I think that it's useful to have those alongside edf for interpretation. I'm not big on stars either. In theory I hoped that printing them would encourage people to use the output, but if they want them I suppose they can add it themselves.

gavinsimpson commented 1 year ago

I've added a few bits of this to overview(). As of 0.8.1.13 the output is:

> overview(m_by, stars = TRUE)                                                           

Generalized Additive Model with 5 terms

  term       type           k   edf statistic p.value   stars    
  <chr>      <chr>      <dbl> <dbl>     <dbl> <chr>     <noquote>
1 fac        parametric    NA  2     150.     < 0.001   ***      
2 s(x2):fac1 TPRS           9  2.85    4.62   0.0021453 **       
3 s(x2):fac2 TPRS           9  2.95   30.9    < 0.001   ***      
4 s(x2):fac3 TPRS           9  6.49   30.9    < 0.001   ***      
5 s(x0)      TPRS           9  1.00    0.0168 0.8996412

but stars is FALSE by default.

I was thinking about effect sizes, and while it's not too onerous to do this for univariate smooths (evaluating the smooth at 100 evenly spaced values should be fine for most cases), it gets slower when we're doing this for multidimensional smooths unless we keep the number of evaluation points low-ish. I haven't looked into where things are slowing down and an obvious place to speed things up would be to run each smooth in a separate thread/process with furrr and future, but I haven't implemented any of the effect-size bits yet because I want to think a little more about them.

noamross commented 1 year ago

For the 2 papers I'm doing this for right now, we decided to have min/max in the summary calculated only from the fitted values on the data the model was trained on, so there's no additional predictions to make and these min/max values don't interpolate or extrapolate at all. This is at least predictable. For 2D+ one needs to think about things like too.far anyway and I think that's too case-base-case to get into. I think cases where the 1D min/max is well outside the fitted values are edge cases that would need to be diagnosed by looking at partial effects anyway.

asgersvenning commented 1 month ago

Hi, I've been using the package for a bit now and I really enjoy it. I was also trying to figure out how to extract effect sizes, particularly for the fixed/parametric effects, so I thought this would be a good place to discuss this.

I figured that overview was what I was looking for, but given that it currently uses pTerms.table from the gam summary object: https://github.com/gavinsimpson/gratia/blob/0d0016932bda137666c5dc05332f7ba8539b9d0f/R/overview.R#L76 which aggregates factor effects and also doesn't contain the effect sizes I tried modifying overview to instead use p.table:

mod_overview <- function(
    model, 
    parametric = TRUE, 
    parametric_effect_sizes = FALSE,
    random_effects = TRUE,
    dispersion = NULL,
    frequentist = FALSE,
    accuracy = 0.001,
    stars = FALSE,
    ...
) {
  smry <- model %>% 
    summary(
      dispersion = dispersion,
      re.test = random_effects,
      freq = frequentist
    )

  nms <- c("term", "type", "estimate", "se", "k", "edf", "statistic", "p.value")

  # smooth terms
  types <- vapply(model$smooth, smooth_type, character(1))
  dfs <- vapply(model$smooth, basis_size, double(1))

  out <- smry$s.table %>% 
    as.data.frame() %>%
    rownames_to_column() %>%
    as_tibble()

  if (nrow(out) > 0) {
    out <- out %>%
      select(!matches("Ref.df")) %>%
      rename(
        statistic = 3
      ) %>% 
      add_column(
        type = types,
        Estimate = NA_real_,
        `Std. Error` = NA_real_,
        k = dfs, 
        .after = 1L
      )
  }

  # parametric terms
  para <- NULL
  if (isTRUE(parametric) && !is.null(smry$p.table)) {
    if (isFALSE(parametric_effect_sizes)) {
      para <- as.data.frame(smry$pTerms.table) %>%
        rownames_to_column() %>%
        as_tibble()
      if (nrow(para) > 0) {
        para <- para %>% 
          rename(
            edf = df,
            statistic = 3,
          ) %>% 
          add_column(
            type = "parametric",
            Estimate = NA_real_,
            `Std. Error` = NA_real_,
            k = NA_real_,
            .after = 1L
          )
      }
    } else {
      para <- smry$p.table %>% 
        as.data.frame() %>%
        rownames_to_column() %>%
        as_tibble()

      if (nrow(para) > 0) {
        para <- para %>% 
          rename(
            statistic = 4,
            `p-value` = 5
          ) %>% 
          add_column(
            type = "parametric", 
            .after = 1L
          ) %>% 
          add_column(
            k = NA_real_,
            edf = 1,
            .after = 4L
          )
      }
    }

    out <- bind_rows(para, out)
  }

  out <- set_names(out, nms)

  if (isFALSE(parametric_effect_sizes)) out <- out %>% 
    select(!c(estimate, se))

  if (stars) {
    sstars <- out$p.value %>% 
      symnum(
        corr = FALSE,
        na = FALSE,
        cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
        symbols = c("***", "**", "*", ".", " ")
      )
    out <- out %>% 
      mutate(
        # p = .data$p.value,
        p.value = format.pval(.data$p.value, eps = accuracy),
        stars = sstars
      ) # not sure why as.character(sstars) is wrong here "***"
    attr(out, "legend") <- attr(sstars, "legend")
  } else {
    out <- out %>% 
      mutate(
        p.value = format.pval(.data$p.value, eps = accuracy)
      )
  }

  class(out) <- append(class(out), values = "overview", after = 0)
  out
}

The goal was to make the output of overview as similar to mgcv:::print.summary.gam as possible. (implemented it to default to the old behaviour for backward compatibility)

Example

Consider a model:

mod <- gam(Petal.Width ~ Sepal.Length*Species + s(Petal.Length), data = iris)

Then the summary output is:

summary(mod)

Family: gaussian 
Link function: identity 

Formula:
Petal.Width ~ Sepal.Length * Species + s(Petal.Length)

Parametric coefficients:
                               Estimate Std. Error t value Pr(>|t|)  
(Intercept)                     0.58513    0.43255   1.353   0.1783  
Sepal.Length                    0.05635    0.07356   0.766   0.4449  
Speciesversicolor               0.72497    0.54825   1.322   0.1882  
Speciesvirginica                1.08021    0.62534   1.727   0.0863 .
Sepal.Length:Speciesversicolor -0.07538    0.09938  -0.758   0.4494  
Sepal.Length:Speciesvirginica  -0.07635    0.10143  -0.753   0.4529  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                  edf Ref.df     F  p-value    
s(Petal.Length) 3.044  3.983 6.204 0.000125 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.946   Deviance explained = 94.9%
GCV = 0.033147  Scale est. = 0.031149  n = 150

But the output of overview is:

overview(mod)

Generalized Additive Model with 4 terms

  term                 type           k   edf statistic p.value
  <chr>                <chr>      <dbl> <dbl>     <dbl> <chr>  
1 Sepal.Length         parametric    NA  1        0.587 0.44489
2 Species              parametric    NA  2        1.58  0.20905
3 Sepal.Length:Species parametric    NA  2        0.356 0.70108
4 s(Petal.Length)      TPRS           9  3.04     6.20  < 0.001

Whereas the output of mod_overview is:

mod_overview(mod, parametric_effect_sizes = TRUE)

Generalized Additive Model with 7 terms

  term                           type       estimate      se     k   edf statistic p.value 
  <chr>                          <chr>         <dbl>   <dbl> <dbl> <dbl>     <dbl> <chr>   
1 (Intercept)                    parametric   0.585   0.433     NA  1        1.35  0.178296
2 Sepal.Length                   parametric   0.0564  0.0736    NA  1        0.766 0.444891
3 Speciesversicolor              parametric   0.725   0.548     NA  1        1.32  0.188196
4 Speciesvirginica               parametric   1.08    0.625     NA  1        1.73  0.086286
5 Sepal.Length:Speciesversicolor parametric  -0.0754  0.0994    NA  1       -0.758 0.449425
6 Sepal.Length:Speciesvirginica  parametric  -0.0764  0.101     NA  1       -0.753 0.452856
7 s(Petal.Length)                TPRS        NA      NA          9  3.04     6.20  < 0.001

Is this something that might fit gratia? If it is I can make a PR :)

EDIT: type error.

asgersvenning commented 1 month ago

Reprex

library(tidyverse)
library(mgcv)
#> Indlæser krævet pakke: nlme
#> 
#> Vedhæfter pakke: 'nlme'
#> Det følgende objekt er maskeret fra 'package:dplyr':
#> 
#>     collapse
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(gratia)
#> 
#> Vedhæfter pakke: 'gratia'
#> Det følgende objekt er maskeret fra 'package:stringr':
#> 
#>     boundary

mod_overview <- function(
    model, 
    parametric = TRUE, 
    parametric_effect_sizes = FALSE,
    random_effects = TRUE,
    dispersion = NULL,
    frequentist = FALSE,
    accuracy = 0.001,
    stars = FALSE,
    ...
) {
  smry <- model %>% 
    summary(
      dispersion = dispersion,
      re.test = random_effects,
      freq = frequentist
    )

  nms <- c("term", "type", "estimate", "se", "k", "edf", "statistic", "p.value")

  # smooth terms
  types <- vapply(model$smooth, smooth_type, character(1))
  dfs <- vapply(model$smooth, basis_size, double(1))

  out <- smry$s.table %>% 
    as.data.frame() %>%
    rownames_to_column() %>%
    as_tibble()

  if (nrow(out) > 0) {
    out <- out %>%
      select(!matches("Ref.df")) %>%
      rename(
        statistic = 3
      ) %>% 
      add_column(
        type = types,
        Estimate = NA_real_,
        `Std. Error` = NA_real_,
        k = dfs, 
        .after = 1L
      )
  }

  # parametric terms
  para <- NULL
  if (isTRUE(parametric) && !is.null(smry$p.table)) {
    if (isFALSE(parametric_effect_sizes)) {
      para <- as.data.frame(smry$pTerms.table) %>%
        rownames_to_column() %>%
        as_tibble()
      if (nrow(para) > 0) {
        para <- para %>% 
          rename(
            edf = df,
            statistic = 3,
          ) %>% 
          add_column(
            type = "parametric",
            Estimate = NA_real_,
            `Std. Error` = NA_real_,
            k = NA_real_,
            .after = 1L
          )
      }
    } else {
      para <- smry$p.table %>% 
        as.data.frame() %>%
        rownames_to_column() %>%
        as_tibble()

      if (nrow(para) > 0) {
        para <- para %>% 
          rename(
            statistic = 4,
            `p-value` = 5
          ) %>% 
          add_column(
            type = "parametric", 
            .after = 1L
          ) %>% 
          add_column(
            k = NA_real_,
            edf = 1,
            .after = 4L
          )
      }
    }

    out <- bind_rows(para, out)
  }

  out <- set_names(out, nms)

  if (isFALSE(parametric_effect_sizes)) out <- out %>% 
    select(!c(estimate, se))

  if (stars) {
    sstars <- out$p.value %>% 
      symnum(
        corr = FALSE,
        na = FALSE,
        cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
        symbols = c("***", "**", "*", ".", " ")
      )
    out <- out %>% 
      mutate(
        # p = .data$p.value,
        p.value = format.pval(.data$p.value, eps = accuracy),
        stars = sstars
      ) # not sure why as.character(sstars) is wrong here "***"
    attr(out, "legend") <- attr(sstars, "legend")
  } else {
    out <- out %>% 
      mutate(
        p.value = format.pval(.data$p.value, eps = accuracy)
      )
  }

  class(out) <- append(class(out), values = "overview", after = 0)
  out
}

mod <- gam(Petal.Width ~ Sepal.Width*Species + s(Petal.Length), data = iris)

summary(mod)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> Petal.Width ~ Sepal.Width * Species + s(Petal.Length)
#> 
#> Parametric coefficients:
#>                               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                    0.52470    0.26301   1.995 0.047966 *  
#> Sepal.Width                    0.04917    0.06099   0.806 0.421477    
#> Speciesversicolor              0.02786    0.34396   0.081 0.935569    
#> Speciesvirginica               0.02175    0.37911   0.057 0.954330    
#> Sepal.Width:Speciesversicolor  0.18676    0.10398   1.796 0.074605 .  
#> Sepal.Width:Speciesvirginica   0.33826    0.09789   3.455 0.000726 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                   edf Ref.df     F  p-value    
#> s(Petal.Length) 2.778  3.588 5.583 0.000672 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.956   Deviance explained = 95.8%
#> GCV = 0.027261  Scale est. = 0.025666  n = 150
overview(mod)
#> 
#> Generalized Additive Model with 4 terms
#> 
#>   term                type           k   edf statistic p.value 
#>   <chr>               <chr>      <dbl> <dbl>     <dbl> <chr>   
#> 1 Sepal.Width         parametric    NA  1      0.650   0.421477
#> 2 Species             parametric    NA  2      0.00335 0.996658
#> 3 Sepal.Width:Species parametric    NA  2      6.09    0.002894
#> 4 s(Petal.Length)     TPRS           9  2.78   5.58    < 0.001
mod_overview(mod)
#> 
#> Generalized Additive Model with 4 terms
#> 
#>   term                type           k   edf statistic p.value 
#>   <chr>               <chr>      <dbl> <dbl>     <dbl> <chr>   
#> 1 Sepal.Width         parametric    NA  1      0.650   0.421477
#> 2 Species             parametric    NA  2      0.00335 0.996658
#> 3 Sepal.Width:Species parametric    NA  2      6.09    0.002894
#> 4 s(Petal.Length)     TPRS           9  2.78   5.58    < 0.001
mod_overview(mod, parametric_effect_sizes = T, stars=TRUE)
#> 
#> Generalized Additive Model with 7 terms
#> 
#>   term                type  estimate      se     k   edf statistic p.value stars
#>   <chr>               <chr>    <dbl>   <dbl> <dbl> <dbl>     <dbl> <chr>   <noq>
#> 1 (Intercept)         para…   0.525   0.263     NA  1       2.00   0.0479… *    
#> 2 Sepal.Width         para…   0.0492  0.0610    NA  1       0.806  0.4214…      
#> 3 Speciesversicolor   para…   0.0279  0.344     NA  1       0.0810 0.9355…      
#> 4 Speciesvirginica    para…   0.0218  0.379     NA  1       0.0574 0.9543…      
#> 5 Sepal.Width:Specie… para…   0.187   0.104     NA  1       1.80   0.0746… .    
#> 6 Sepal.Width:Specie… para…   0.338   0.0979    NA  1       3.46   < 0.001 ***  
#> 7 s(Petal.Length)     TPRS   NA      NA          9  2.78    5.58   < 0.001 ***  
#> 
#> # 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Created on 2024-10-22 with reprex v2.1.0

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.1 (2024-06-14 ucrt) #> os Windows 11 x64 (build 22631) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Danish_Denmark.utf8 #> ctype Danish_Denmark.utf8 #> tz Europe/Copenhagen #> date 2024-10-22 #> pandoc 3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.3 2024-06-21 [2] CRAN (R 4.4.1) #> colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.1) #> digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1) #> dplyr * 1.1.4 2023-11-17 [2] CRAN (R 4.4.1) #> evaluate 1.0.0 2024-09-17 [1] CRAN (R 4.4.1) #> fansi 1.0.6 2023-12-08 [2] CRAN (R 4.4.1) #> farver 2.1.2 2024-05-13 [2] CRAN (R 4.4.1) #> fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.4.1) #> forcats * 1.0.0 2023-01-29 [2] CRAN (R 4.4.1) #> fs 1.6.4 2024-04-25 [2] CRAN (R 4.4.1) #> generics 0.1.3 2022-07-05 [2] CRAN (R 4.4.1) #> ggokabeito 0.1.0 2021-10-18 [1] CRAN (R 4.4.1) #> ggplot2 * 3.5.1 2024-04-23 [2] CRAN (R 4.4.1) #> glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1) #> gratia * 0.9.2.9010 2024-10-20 [1] Github (gavinsimpson/gratia@0d00169) #> gtable 0.3.5 2024-04-22 [2] CRAN (R 4.4.1) #> hms 1.1.3 2023-03-21 [2] CRAN (R 4.4.1) #> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.4.1) #> knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1) #> lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.1) #> lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.4.1) #> lubridate * 1.9.3 2023-09-27 [2] CRAN (R 4.4.1) #> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.4.1) #> Matrix 1.7-0 2024-04-26 [2] CRAN (R 4.4.1) #> mgcv * 1.9-1 2023-12-21 [2] CRAN (R 4.4.1) #> munsell 0.5.1 2024-04-01 [2] CRAN (R 4.4.1) #> mvnfast 0.2.8 2023-02-23 [1] CRAN (R 4.4.1) #> nlme * 3.1-164 2023-11-27 [2] CRAN (R 4.4.1) #> patchwork 1.3.0 2024-09-16 [1] CRAN (R 4.4.1) #> pillar 1.9.0 2023-03-22 [2] CRAN (R 4.4.1) #> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.4.1) #> purrr * 1.0.2 2023-08-10 [2] CRAN (R 4.4.1) #> R6 2.5.1 2021-08-19 [2] CRAN (R 4.4.1) #> Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.1) #> readr * 2.1.5 2024-01-10 [2] CRAN (R 4.4.1) #> reprex 2.1.0 2024-01-11 [2] CRAN (R 4.4.1) #> rlang 1.1.4 2024-06-04 [2] CRAN (R 4.4.1) #> rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.1) #> rstudioapi 0.16.0 2024-03-24 [2] CRAN (R 4.4.1) #> scales 1.3.0 2023-11-28 [2] CRAN (R 4.4.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.1) #> stringi 1.8.4 2024-05-06 [2] CRAN (R 4.4.0) #> stringr * 1.5.1 2023-11-14 [2] CRAN (R 4.4.1) #> tibble * 3.2.1 2023-03-20 [2] CRAN (R 4.4.1) #> tidyr * 1.3.1 2024-01-24 [2] CRAN (R 4.4.1) #> tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.4.1) #> tidyverse * 2.0.0 2023-02-22 [2] CRAN (R 4.4.1) #> timechange 0.3.0 2024-01-18 [2] CRAN (R 4.4.1) #> tzdb 0.4.0 2023-05-12 [2] CRAN (R 4.4.1) #> utf8 1.2.4 2023-10-22 [2] CRAN (R 4.4.1) #> vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.4.1) #> withr 3.0.1 2024-07-31 [1] CRAN (R 4.4.1) #> xfun 0.48 2024-10-03 [1] CRAN (R 4.4.1) #> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.1) #> #> [1] C:/Users/asger/AppData/Local/R/win-library/4.4 #> [2] C:/Program Files/R/R-4.4.1/library #> #> ────────────────────────────────────────────────────────────────────────────── ```