Comments from Gavin Simpson

In the GAM workshop there is an inconsistency between the Prezi and the HTML page with the R code, with the latter being potentially incorrect, in the section on Random Smooths Specifically, in the model

gamm_smooth <- gam(y ~ s(x0, fac, bs="fs", m=1), data=gam_data2)

you are missing s(x0)

gamm_smooth <- gam(y ~ s(x0) + s(x0, fac, bs="fs", m=1), data=gam_data2)

The idea being that s(x0) models the "population" level smooth effect of x0 on the response, and the random smooth basis, which are penalized in terms of the first derivative of the splines (m=1) represent deviations from the population level effect for fac-specific splines.

However, these bs = 'fs' splines are already fully penalized (they have penalties on the null space of the basis) so they can be shrunk back to straight lines or even null effects, hence

gamm_smooth <- gam(y ~ s(x0) + s(x0, fac, bs="fs"), data=gam_data2)

should work just fine too and is consistent with the example given in Simon's new edition of his GAM book. This formulation has the advantage of not worrying people about the order of the derivative penalty and given the way the penalties work will pull the fac-specific smooths back towards the population smooth in the same way as random slopes are shrunk towards the population slope in a mixed model.

With your formulation you are shrinking towards the mean (intercept) which is conceptually different to the previous examples.

Note slide 78 of the Prezi has this form of the model

gamm_smooth <- gam(y ~ s(x0) + s(x0, fac, bs="fs", m=1), data=gam_data2)

so regardless of intention there is an inconsistency between that and the HTML page.

The first is in the section Test for linearity. The problem with the approach used in the workshop is that it isn't quite correct to fit this model

gam(y_obs ~ x + s(x))

the basis expansion for s(x) contains the linear basis function which is confounded with the linear term we'll estimate for x as specified in that model. The linear basis function is in what is called the null space of the penalty. The wiggly basis functions are in what is known as the penalty range space. What we want to achieve is to separate out the basis functions in the penalty null space from the wiggly functions in the range space. Because the intercept and the linear term for x in the model are the two terms in the penalty null space for the thin plate spline basis implied by s(x), you do't want them in the basis expansion for x. The identifiability constraints (sum-to-zero) take care of the constant term in the basis expansion, but that leaves the linear basis function. We can remove that for thinplate splines by asking for a basis expansion without any null space at all, via the m argument to s(). Hence, a truly nested model isolating the nonlinearity in the smooth term and the constant (intercept) and linear parts as parametric terms we need to fit the model as

gam(y_obs ~ x + s(x, m = c(2, 0)))

where in m = c(2, 0) we are asking for a second order penalty with no penalty null space,. The second-order bit is the default indicating a penalty on the integrated squared second derivative. It is the 0 that ensures no null pace and hence a properly nested model. now you can use summary() on the second model and evaluate the Wald-like test for the smooth to perform your test of linearity. You can read more about this in 6.12.3 of the 2nd edition of Simon Wood's GAM book (pp. 312--313).

The second thing is that you should probably caution that when using anova() in multi-model mode (as the workshop does extensively for gam models), that the p-values are generally too low when you don't account for having estimated the smoothness parameters. When fitting via GCV, as your models all do, you can't correct for having estimated smoothness parameters, hence you need to interpret with caution the p values from the generalized LRTs performed by anova(). Alternatively you could switch to using ML or REML smoothness selection. This is discussed in detail in section 6.12.4 of the second edition of Simon Woo's GAM book (pp. 313--315).

Now that we have a better AIC that accounts for estimating smoothness parameters also, you may consider using it instead of or in addition to anova() for comparing model fits. I would also emphasise the summary() output for models as 1 this is well found theoretically now and does the right thing with models that include random effect terms (anova() fails spectacularly with random effect splines for example), and the Wald-like test are really ones of whether the estimated function is consistent with zero effect. The order of sections "other distributions" and "quick intro to GAMMs" is switched. The main recommendations are: (1) Given that the material is exhaustive, I'd recommend making clear in the protocol that running through the entire material should be less important than making sure everybody has understood what their code and can interpret the results. In our workshops we sought the feedback of participants in the second half of the workshop to decide which of the remaining topics we'll cover. (2) Since temporal autocorrelation is already a complex topic, it should not be treated at the beginning of the random effects / mixed models part of the workshop. (3) Model selection relies on ANOVA, which gives only approximate p values for smoothing functions and the author of the mgcv packages cautions against relying only on ANOVA for model selection. I recommend using AIC as the main measure of model fit and ANOVA only in addition (we did so in our workshop and it worked well). (4) I recommend to consider finding more tangible (i.e. real-life) datasets that can be used as examples, especially for the random effects and interactions part. The theoretical underpinnings of variable interactions and random effects. The script is quite clear but a real-life example instead of a hypothetical dataset could make it easier for participants to understand the concepts.

QCBSRworkshops / workshop08

Comments from Gavin Simpson #1