Open noamross opened 1 year ago
There's currently an overview()
that does some of this:
library("gratia")
library("mgcv")
eg1 <- data_sim("eg1", n = 1000, dist = "normal", scale = 2, seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3),
data = eg1,
method = "REML")
overview(m)
> overview(m)
Generalized Additive Model with 4 terms
term type edf statistic p.value
<chr> <chr> <dbl> <dbl> <chr>
1 s(x0) TPRS 4.23 16.1 < 0.001
2 s(x1) TPRS 3.50 169. < 0.001
3 s(x2) TPRS 8.61 201. < 0.001
4 s(x3) TPRS 1.00 0.599 0.43875
It currently doesn't return estimates for parametric terms:
eg4 <- data_sim("eg4", n = 400, dist = "normal", scale = 2, seed = 1)
m_by <- gam(y ~ fac + s(x2, by = fac) + s(x0),
data = su_eg4, method = "REML")
overview(m_by)
> overview(m_by)
Generalized Additive Model with 5 terms
term type edf statistic p.value
<chr> <chr> <dbl> <dbl> <chr>
1 fac parametric 2 150. < 0.001
2 s(x2):fac1 TPRS 2.85 4.62 0.0021453
3 s(x2):fac2 TPRS 2.95 30.9 < 0.001
4 s(x2):fac3 TPRS 6.49 30.9 < 0.001
5 s(x0) TPRS 1.00 0.0168 0.8996412
It wouldn't be hard to have "parametric"
be "factor"
or "continuous"
say, but I hadn't thought about returning the actual estimates (and the constant term is missing here).
What do you mean by Knots
? The number of basis functions used? That would be easy to add.
I'm not a big fan of stars on p values, so I'd resist including them.
Expanding on the basis names (type
) could be done - I was trying to avoid having the output be too wide, but my short-names are not very consistent (some aren't very short for some bases).
The above output is a plain tibble with some custom formatting rather than a specific print
method: https://github.com/gavinsimpson/gratia/blob/HEAD/R/overview.R
Is this (sort of) what you had in mind?
I did not know about overview()
! Yes, this is a way towards what I'm looking for. I do think anything that makes it easier for people to see and report effect sizes is important - they're under-reported even in linear models and really rare/hard for people to extract with GAMs. Yes, by Knots I mean k
- I think that it's useful to have those alongside edf
for interpretation. I'm not big on stars either. In theory I hoped that printing them would encourage people to use the output, but if they want them I suppose they can add it themselves.
I've added a few bits of this to overview()
. As of 0.8.1.13 the output is:
> overview(m_by, stars = TRUE)
Generalized Additive Model with 5 terms
term type k edf statistic p.value stars
<chr> <chr> <dbl> <dbl> <dbl> <chr> <noquote>
1 fac parametric NA 2 150. < 0.001 ***
2 s(x2):fac1 TPRS 9 2.85 4.62 0.0021453 **
3 s(x2):fac2 TPRS 9 2.95 30.9 < 0.001 ***
4 s(x2):fac3 TPRS 9 6.49 30.9 < 0.001 ***
5 s(x0) TPRS 9 1.00 0.0168 0.8996412
but stars
is FALSE
by default.
I was thinking about effect sizes, and while it's not too onerous to do this for univariate smooths (evaluating the smooth at 100 evenly spaced values should be fine for most cases), it gets slower when we're doing this for multidimensional smooths unless we keep the number of evaluation points low-ish. I haven't looked into where things are slowing down and an obvious place to speed things up would be to run each smooth in a separate thread/process with furrr and future, but I haven't implemented any of the effect-size bits yet because I want to think a little more about them.
For the 2 papers I'm doing this for right now, we decided to have min/max in the summary calculated only from the fitted values on the data the model was trained on, so there's no additional predictions to make and these min/max values don't interpolate or extrapolate at all. This is at least predictable. For 2D+ one needs to think about things like too.far
anyway and I think that's too case-base-case to get into. I think cases where the 1D min/max is well outside the fitted values are edge cases that would need to be diagnosed by looking at partial effects anyway.
Hi, I've been using the package for a bit now and I really enjoy it. I was also trying to figure out how to extract effect sizes, particularly for the fixed/parametric effects, so I thought this would be a good place to discuss this.
I figured that overview
was what I was looking for, but given that it currently uses pTerms.table
from the gam
summary object: https://github.com/gavinsimpson/gratia/blob/0d0016932bda137666c5dc05332f7ba8539b9d0f/R/overview.R#L76 which aggregates factor effects and also doesn't contain the effect sizes I tried modifying overview
to instead use p.table
:
mod_overview <- function(
model,
parametric = TRUE,
parametric_effect_sizes = FALSE,
random_effects = TRUE,
dispersion = NULL,
frequentist = FALSE,
accuracy = 0.001,
stars = FALSE,
...
) {
smry <- model %>%
summary(
dispersion = dispersion,
re.test = random_effects,
freq = frequentist
)
nms <- c("term", "type", "estimate", "se", "k", "edf", "statistic", "p.value")
# smooth terms
types <- vapply(model$smooth, smooth_type, character(1))
dfs <- vapply(model$smooth, basis_size, double(1))
out <- smry$s.table %>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()
if (nrow(out) > 0) {
out <- out %>%
select(!matches("Ref.df")) %>%
rename(
statistic = 3
) %>%
add_column(
type = types,
Estimate = NA_real_,
`Std. Error` = NA_real_,
k = dfs,
.after = 1L
)
}
# parametric terms
para <- NULL
if (isTRUE(parametric) && !is.null(smry$p.table)) {
if (isFALSE(parametric_effect_sizes)) {
para <- as.data.frame(smry$pTerms.table) %>%
rownames_to_column() %>%
as_tibble()
if (nrow(para) > 0) {
para <- para %>%
rename(
edf = df,
statistic = 3,
) %>%
add_column(
type = "parametric",
Estimate = NA_real_,
`Std. Error` = NA_real_,
k = NA_real_,
.after = 1L
)
}
} else {
para <- smry$p.table %>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()
if (nrow(para) > 0) {
para <- para %>%
rename(
statistic = 4,
`p-value` = 5
) %>%
add_column(
type = "parametric",
.after = 1L
) %>%
add_column(
k = NA_real_,
edf = 1,
.after = 4L
)
}
}
out <- bind_rows(para, out)
}
out <- set_names(out, nms)
if (isFALSE(parametric_effect_sizes)) out <- out %>%
select(!c(estimate, se))
if (stars) {
sstars <- out$p.value %>%
symnum(
corr = FALSE,
na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " ")
)
out <- out %>%
mutate(
# p = .data$p.value,
p.value = format.pval(.data$p.value, eps = accuracy),
stars = sstars
) # not sure why as.character(sstars) is wrong here "***"
attr(out, "legend") <- attr(sstars, "legend")
} else {
out <- out %>%
mutate(
p.value = format.pval(.data$p.value, eps = accuracy)
)
}
class(out) <- append(class(out), values = "overview", after = 0)
out
}
The goal was to make the output of overview
as similar to mgcv:::print.summary.gam
as possible. (implemented it to default to the old behaviour for backward compatibility)
Consider a model:
mod <- gam(Petal.Width ~ Sepal.Length*Species + s(Petal.Length), data = iris)
Then the summary output is:
summary(mod)
Family: gaussian
Link function: identity
Formula:
Petal.Width ~ Sepal.Length * Species + s(Petal.Length)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.58513 0.43255 1.353 0.1783
Sepal.Length 0.05635 0.07356 0.766 0.4449
Speciesversicolor 0.72497 0.54825 1.322 0.1882
Speciesvirginica 1.08021 0.62534 1.727 0.0863 .
Sepal.Length:Speciesversicolor -0.07538 0.09938 -0.758 0.4494
Sepal.Length:Speciesvirginica -0.07635 0.10143 -0.753 0.4529
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Petal.Length) 3.044 3.983 6.204 0.000125 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.946 Deviance explained = 94.9%
GCV = 0.033147 Scale est. = 0.031149 n = 150
But the output of overview
is:
overview(mod)
Generalized Additive Model with 4 terms
term type k edf statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Sepal.Length parametric NA 1 0.587 0.44489
2 Species parametric NA 2 1.58 0.20905
3 Sepal.Length:Species parametric NA 2 0.356 0.70108
4 s(Petal.Length) TPRS 9 3.04 6.20 < 0.001
Whereas the output of mod_overview
is:
mod_overview(mod, parametric_effect_sizes = TRUE)
Generalized Additive Model with 7 terms
term type estimate se k edf statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) parametric 0.585 0.433 NA 1 1.35 0.178296
2 Sepal.Length parametric 0.0564 0.0736 NA 1 0.766 0.444891
3 Speciesversicolor parametric 0.725 0.548 NA 1 1.32 0.188196
4 Speciesvirginica parametric 1.08 0.625 NA 1 1.73 0.086286
5 Sepal.Length:Speciesversicolor parametric -0.0754 0.0994 NA 1 -0.758 0.449425
6 Sepal.Length:Speciesvirginica parametric -0.0764 0.101 NA 1 -0.753 0.452856
7 s(Petal.Length) TPRS NA NA 9 3.04 6.20 < 0.001
Is this something that might fit gratia
? If it is I can make a PR :)
EDIT: type error.
Reprex
library(tidyverse)
library(mgcv)
#> Indlæser krævet pakke: nlme
#>
#> Vedhæfter pakke: 'nlme'
#> Det følgende objekt er maskeret fra 'package:dplyr':
#>
#> collapse
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(gratia)
#>
#> Vedhæfter pakke: 'gratia'
#> Det følgende objekt er maskeret fra 'package:stringr':
#>
#> boundary
mod_overview <- function(
model,
parametric = TRUE,
parametric_effect_sizes = FALSE,
random_effects = TRUE,
dispersion = NULL,
frequentist = FALSE,
accuracy = 0.001,
stars = FALSE,
...
) {
smry <- model %>%
summary(
dispersion = dispersion,
re.test = random_effects,
freq = frequentist
)
nms <- c("term", "type", "estimate", "se", "k", "edf", "statistic", "p.value")
# smooth terms
types <- vapply(model$smooth, smooth_type, character(1))
dfs <- vapply(model$smooth, basis_size, double(1))
out <- smry$s.table %>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()
if (nrow(out) > 0) {
out <- out %>%
select(!matches("Ref.df")) %>%
rename(
statistic = 3
) %>%
add_column(
type = types,
Estimate = NA_real_,
`Std. Error` = NA_real_,
k = dfs,
.after = 1L
)
}
# parametric terms
para <- NULL
if (isTRUE(parametric) && !is.null(smry$p.table)) {
if (isFALSE(parametric_effect_sizes)) {
para <- as.data.frame(smry$pTerms.table) %>%
rownames_to_column() %>%
as_tibble()
if (nrow(para) > 0) {
para <- para %>%
rename(
edf = df,
statistic = 3,
) %>%
add_column(
type = "parametric",
Estimate = NA_real_,
`Std. Error` = NA_real_,
k = NA_real_,
.after = 1L
)
}
} else {
para <- smry$p.table %>%
as.data.frame() %>%
rownames_to_column() %>%
as_tibble()
if (nrow(para) > 0) {
para <- para %>%
rename(
statistic = 4,
`p-value` = 5
) %>%
add_column(
type = "parametric",
.after = 1L
) %>%
add_column(
k = NA_real_,
edf = 1,
.after = 4L
)
}
}
out <- bind_rows(para, out)
}
out <- set_names(out, nms)
if (isFALSE(parametric_effect_sizes)) out <- out %>%
select(!c(estimate, se))
if (stars) {
sstars <- out$p.value %>%
symnum(
corr = FALSE,
na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " ")
)
out <- out %>%
mutate(
# p = .data$p.value,
p.value = format.pval(.data$p.value, eps = accuracy),
stars = sstars
) # not sure why as.character(sstars) is wrong here "***"
attr(out, "legend") <- attr(sstars, "legend")
} else {
out <- out %>%
mutate(
p.value = format.pval(.data$p.value, eps = accuracy)
)
}
class(out) <- append(class(out), values = "overview", after = 0)
out
}
mod <- gam(Petal.Width ~ Sepal.Width*Species + s(Petal.Length), data = iris)
summary(mod)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> Petal.Width ~ Sepal.Width * Species + s(Petal.Length)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.52470 0.26301 1.995 0.047966 *
#> Sepal.Width 0.04917 0.06099 0.806 0.421477
#> Speciesversicolor 0.02786 0.34396 0.081 0.935569
#> Speciesvirginica 0.02175 0.37911 0.057 0.954330
#> Sepal.Width:Speciesversicolor 0.18676 0.10398 1.796 0.074605 .
#> Sepal.Width:Speciesvirginica 0.33826 0.09789 3.455 0.000726 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(Petal.Length) 2.778 3.588 5.583 0.000672 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.956 Deviance explained = 95.8%
#> GCV = 0.027261 Scale est. = 0.025666 n = 150
overview(mod)
#>
#> Generalized Additive Model with 4 terms
#>
#> term type k edf statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Sepal.Width parametric NA 1 0.650 0.421477
#> 2 Species parametric NA 2 0.00335 0.996658
#> 3 Sepal.Width:Species parametric NA 2 6.09 0.002894
#> 4 s(Petal.Length) TPRS 9 2.78 5.58 < 0.001
mod_overview(mod)
#>
#> Generalized Additive Model with 4 terms
#>
#> term type k edf statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Sepal.Width parametric NA 1 0.650 0.421477
#> 2 Species parametric NA 2 0.00335 0.996658
#> 3 Sepal.Width:Species parametric NA 2 6.09 0.002894
#> 4 s(Petal.Length) TPRS 9 2.78 5.58 < 0.001
mod_overview(mod, parametric_effect_sizes = T, stars=TRUE)
#>
#> Generalized Additive Model with 7 terms
#>
#> term type estimate se k edf statistic p.value stars
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <noq>
#> 1 (Intercept) para… 0.525 0.263 NA 1 2.00 0.0479… *
#> 2 Sepal.Width para… 0.0492 0.0610 NA 1 0.806 0.4214…
#> 3 Speciesversicolor para… 0.0279 0.344 NA 1 0.0810 0.9355…
#> 4 Speciesvirginica para… 0.0218 0.379 NA 1 0.0574 0.9543…
#> 5 Sepal.Width:Specie… para… 0.187 0.104 NA 1 1.80 0.0746… .
#> 6 Sepal.Width:Specie… para… 0.338 0.0979 NA 1 3.46 < 0.001 ***
#> 7 s(Petal.Length) TPRS NA NA 9 2.78 5.58 < 0.001 ***
#>
#> # 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2024-10-22 with reprex v2.1.0
I think
gratia
would be a nice home for an alternative GAM model summary table that would be a reasonable first pass for including in a paper or at least a supplement for common models, and be amenable to modification. Things that I think would be good to have:NA
where applicableI'm doing this for myself, and would be happy to start a PR based on it.