gavinsimpson / gratia

ggplot-based graphics and useful functions for GAMs fitted using the mgcv package
https://gavinsimpson.github.io/gratia/
Other
206 stars 28 forks source link

smooth_estimates by factor #217

Closed DHLocke closed 1 year ago

DHLocke commented 1 year ago
# regex: gratia::smooth_estimates with s(x, by = 'fac')

# first off: absolutely great software, thanks so much for creating and maintaining this package. It is really useful. 
# I appreciate all of the effort immensely. 

# second: I'm not sure if this is a problem per se or my inadequate knowledge of how smooth_estimates works with data_slice and mgcv::gam.
# possibly another function like fitted_values would be more appropriate, but I'm not sure.

# third: the reason the specified values of the two continuous covariates are important in my real, particular application
# is that I am fitting 500 gams with resampling. The underlying data have spatial, temporal, and spatio-temporal 
# autocorrelation, and there is a lot data. So each model contains just 2% of the data. In reality the rest of the model
# is more complex, but the reprex shows the central issue of wanting to extract smooths for two continuous predictors
# at specified values, while maintaining the 'by' part of the s(x, by = 'fac'). To combine the resampled parameters, they
# should be at the same values and levels to facilitate taking the median and 95% conf interval of the coefs, per by level.

# fourth, this *might* be related to #196 

# fifth: feedback on this regex is welcomed. New to this. ..

library(mgcv)
library(gratia)

# example based on https://gavinsimpson.github.io/gratia/articles/data-slices.html
# simulate data from the bivariate surface
df <- data_sim("eg2", n = 1000, scale = 0.25, seed = 2)

# add factor variable with two levels
df$g <- as.factor(sample(letters[1:2], 1000, replace = TRUE))

# double check
df |> head()
df |> plot()

# fit the GAM
m_biv <- gam(y ~ s(x, by = g) + s(z, by = g) + s(f), data = df, method = "REML") 
# in reality there is more going on the RHS, but not germane to this reprex

# look at all outputs
m_biv |> overview()                      # handsome, succinct table 
m_biv |> draw()                             # by s(x):ga is very similar to s(x):gb in this example, but non in real full example as determined by draw(model_from_2pct_data)     # draw's excellent defaults honor the 'by'
m_biv |> summary()                      # has smooth named "s(x):ga"
m_biv |> smooths()                       # has smooth named "s(x):gb", fantastic

smooth_estimates(m_biv) # great, predictions for x and z with 'by' honored. Very convenient output table for custom graphing, etc.

# BUT x and z are at some arbitrary values

# make a data slice so predictions can be at pre-determined combinations of x, z and for both 'by= ' levels
ds3 <- data_slice(m_biv,
                  x = seq(0, .5, .05),  # two continuous covariates, spaced out
                  z = seq(0, .5, .05),
                  g = c('a', 'b')           # and the binary factor used in the s(var, by= 'fac'). BOTH levels specified
                  )

ds3                              # examine data slice
df |> summary()         # looks like f in ds3 is at median level of original data in df, fantastic.
table(ds3$g)              # looks good
table(ds3$g, ds3$x) # as expected and intended

# try to extract the smooths at specifed x and z values while honoring the 'by'
sm <- smooth_estimates(m_biv, data = ds3)
sm <- smooth_estimates(m_biv, smooth = "s(x):ga", data = ds3)
sm <- smooth_estimates(m_biv, smooth = c("s(x):ga"), data = ds3)
sm <- smooth_estimates(m_biv, smooth = c("s(x)"), data = ds3)
sm <- smooth_estimates(m_biv, smooth = c("s(x)"), data = ds3, partial_match = TRUE)
sm <- smooth_estimates(m_biv, smooth = c("s(x):ga"), data = ds3, partial_match = TRUE)
#  Can't find by variable

smooth_estimates(m_biv, smooth = "s(f)", data = ds3) # works famously, yes!

PredictMat(m_biv, ds3)
# end
gavinsimpson commented 1 year ago

mgcv is very picky about data types; PredictMat() requires that the data is provided exactly as you had it in the data used to fit the model. Even though g is present in ds3 it is not a factor and that in and of itself is sufficient for it to throw the error you observed:

r$> sm <- smooth_estimates(m_biv, data = ds3)                                   
Error in PredictMat(smooth, data) : Can't find by variable
r$> ds4 <- ds3 |> mutate(g = factor(g))                                         
r$> sm <- smooth_estimates(m_biv, data = ds4)                                   
r$>

I should probably do more to check user supplied data.

A couple of observations:

  1. just in case this isn't a typo; your model is incorrect. For a factor by model you need to include the by variable as a factor parametric effect also:

    m_biv <- gam(y ~ g + # <----- very important! s(x, by = g) + s(z, by = g) + s(f), data = df, method = "REML")

  2. this issue has gone unnoticed because I have functions to help prepare data slices, like evenly() (which does work for factors), ref_level() and level(). So I would have created ds3 using

    ds3 <- data_slice(m_biv, x = evenly(x, lower = 0, upper = 0.5, by = 0.05), # two continuous covariates, spaced out z = evenly(z, lower = 0, upper = 0.5, by = 0.05), g = evenly(g) # and the binary factor used in the s(var, by= 'fac'). BOTH levels specified )

    which them works without error in smooth_estimates().

    I've been thinking that evenly() might not be the most intuitive name for something that applies to factors, so I should think about adding something specific for factors to return all the levels or a subset of levels of a factor from a GAM to complement ref_level() and level() which return the reference level or a specific level of a factor with all the levels of the original factor set as the level attribue.