Comments on JOSS paper - Githubissues

(Part of review here.)

This was fun to read and I learnt some things :)

Some minor statistical/editorial comments on the paper:

The author name is shown as "¿citation_author?" in the pdf on the gratia github. Not sure if this is normal.
The formulation for the GAM (first display equation) doesn't allow for multivariate smooths. This is always a pain, and I've struggled to come up with a general form. In the end I used what Simon Wood does in his book and just give an example of what can be done than trying to write a general definition. My version of this is in Miller (2019), in case that's helpful. If you did adapt this bit, there would be some follow-on changes needed in the subsequent text (though I think just giving an example of a univariate spline in the following text is totally appropriate).
In addition you only talk about scale parameters as the other parameters of the distribution. That's fine for exponential family distributions, but for the extended exponential case (as you show with the twlss() model), there are more parameters to consider. You might not want to address this in the intro -- that's fine!
"which, with the default penalty, is the integrated squared second derivative" -- I don't think this is quite right. Do you mean the default basis-penalty combination, thin plate regression splines? (As you go on to talk about.)
Is Figure 1a the basis functions or the basis functions multiplied by their corresponding coefficients? I think it's the latter?
It would be nice to have a legend in Figure 1 so readers can relate the basis functions to the values in the penalty matrix (and e.g., see that the most wiggly function has the highest penalty)? Open to push-back on this, if you want to just give default plots from gratia.
"GAMs fitted by mgcv are an empirical Bayesian model with an improper multivariate normal prior on the basis function coeficients." Is that always the case? One might argue that the intercept is always in the nullspace of every penalty, but one could remove that from the model and have a proper MVN prior? Could just put the improper in brackets with the word "usually"?
In previous snippet, coefficients, not "coeficients"
Miller (2019) should be Miller (2021) -- I updated it on arXiv.
"The response is assumed to be conditionally distributed Tweedie" -> "The response is assumed to be conditionally Tweedie distributed"?
"Model coefficients and smoothing parameters are estimated using restricted maximum likelihood (Wood, 2011)" missing period at end of sentence?
Explanation of ctrl <- gam.control(nthreads = 10)?
"which show significant heteroscedasticity and departure from the condtional distribution of the response given the model" -- could highlight that we see this due to the increasing spread in the deviance residuals wrt values of the linear predictor in the top right plot?
In the previous snippet "condtional" -> "conditional"
"given the absence of important effects in the model" not sure what this means?
In the Tweedie LSS model lead-in, it looks like you're using \varphi rather than \phi for the scale parameter (as you did in the intro), not sure if this was intentional?
"little data wrangling, we can produce an uncertaintry estimate ,using fitted_samples() to" misplaces comma.
In the references:
- Not sure how the JOSS wants you to style "brms" but "Brms" looks weird.
- "Duchon, J. (1977). Splines minimizing rotation-invariant semi-norms in sobolev spaces" Sobolev should be capitalised.
- "Wood, S. N. (2003). Thin plate regression splines: Thin plate regression splines." I think it just has one title? ;)

The author name is shown as "¿citation_author?" in the pdf on the gratia github. Not sure if this is normal.

This seems to be OK in version generated by the review system itself (I think I have something wrong in the YAML meta data for the the citation_author element which is causing the problem there.) So I'll leave this just now (beyond tweaking the citation_author in the YAML) and figure it out if the issue presents later.

Further to our conversation, some thoughts on the improper prior issue:


# get some data
library(mgcv)
set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)

# tprs has a null space dimension of 1 -- cool
gam(y~s(x0), data=dat)$smooth[[1]]$null.space.dim

# ts has a null space dimension of 0 -- also cool
gam(y~s(x0, bs="ts"), data=dat)$smooth[[1]]$null.space.dim

# what about B-splines with a continuous penalty? hmmm
gam(y~s(x0, bs="bs"), data=dat)$smooth[[1]]$null.space.dim

# now using code from the paper ;)
library(gratia)
library(dplyr)
library(ggplot2)

# basis get
bfs <- basis(gam(y~s(x0, bs="bs"), data=dat), select = "s(x0)")

 # plot basis functions and fitted spline
bfs |>
   ggplot(aes(x = x0, y = .value)) +
   geom_line(aes(colour = .bf)) +
   guides(colour = "none") +
   labs(x = "x", y = "y", title = "s(x)")

Well that gives us this:

Screenshot_2024-07-16_15-59-49

which I don't see any "linear" terms in... ¯_(ツ)_/¯

Going back to the boring situation in which we "read the documentation", in ?smooth.construct we have:

null.space.dim: The dimension of the penalty null space (before centering).

So I think this is before identifiability constraints are applied (since they need to be applied at the model, not term level. So I think the nullspace just consists of the intercept in that case.

Doing a tiny bit more digging:

jagam(y~s(x0, bs="bs"), data=dat, centred=FALSE, file="m1.jags")
jagam(y~s(x0, bs="bs")-1, data=dat, centred=FALSE, file="m0.jags")

Produces:

model {
  mu <- X %*% b ## expected response
  for (i in 1:n) { y[i] ~ dnorm(mu[i],tau) } ## response 
  scale <- 1/tau ## convert tau to standard GLM scale
  tau ~ dgamma(.05,.005) ## precision parameter prior 
  ## prior for s(x0)... 
  K1 <- S1[1:10,1:10] * lambda[1]  + S1[1:10,11:20] * lambda[2]
  b[1:10] ~ dmnorm(zero[1:10],K1) 
  ## smoothing parameter priors CHECK...
  for (i in 1:2) {
    lambda[i] ~ dgamma(.05,.005)
    rho[i] <- log(lambda[i])
  }
}

model {
  mu <- X %*% b ## expected response
  for (i in 1:n) { y[i] ~ dnorm(mu[i],tau) } ## response 
  scale <- 1/tau ## convert tau to standard GLM scale
  tau ~ dgamma(.05,.005) ## precision parameter prior 
  ## Parametric effect priors CHECK tau=1/79^2 is appropriate!
  for (i in 1:1) { b[i] ~ dnorm(0,0.00016) }
  ## prior for s(x0)... 
  K1 <- S1[1:10,1:10] * lambda[1]  + S1[1:10,11:20] * lambda[2]
  b[2:11] ~ dmnorm(zero[2:11],K1) 
  ## smoothing parameter priors CHECK...
  for (i in 1:2) {
    lambda[i] ~ dgamma(.05,.005)
    rho[i] <- log(lambda[i])
  }
}

So I think that makes sense? (Running to stand in front of my poster so I will think about this more in a wee bit!)

Going back to the boring situation in which we "read the documentation", in ?smooth.construct we have:

null.space.dim: The dimension of the penalty null space (before centering). So I think this is before identifiability constraints are applied (since they need to be applied at the model, not term level. So I think the nullspace just consists of the intercept in that case.

But for that to be true, wouldn't the null space dimension for a TPRS (the default) be = 2 because it has both constant and linear basis functions, yet as you show it is 1.

If we do

library("mgcv")
library("gratia")

df <- data_sim("gwf2", seed = 2)
m <- gam(y ~ s(x, bs = "cr"), data = df)

# extract penalty
S <- m$smooth[[1]]$S[[1]]

# eigen decompoisition
eigen(S)

producing

> eigen(S)
eigen() decomposition
$values
[1]  2.201890e+00  1.420716e+00  8.582862e-01  4.449231e-01  1.969655e-01
[6]  7.046850e-02  1.825633e-02  2.373200e-03 -6.021430e-17

where 1 eigenvalue is numerically 0 and the matrix is not invertible by normal means.

$> solve(S)
Error in solve.default(S) :
  system is computationally singular: reciprocal condition number = 4.66992e-19

And this is what I would expect, because a linear function is still in the span of basis, even if there is no actual linear function.

Now consider the same model but with select = TRUE, where we include an additional penalty on the null space:

m <- gam(y ~ s(x, bs = "cr"), data = df, select = TRUE)

Now the null space of this smooth is 0, as it should be because of the extra penalty:

> m$smooth[[1]]$null.space.dim
[1] 0

Some more discussion is needed here...

Some further reading has helped, if we're just talking about the Null space of a penalty. The key point I think is that the null space is the span of functions representable by the basis for which the penalty has no effect (and hence it's size is the number of such representable functions that are unaffected by the penalty). The null space doesn't refer to specific basis functions but to the functions representable by that basis.

Section 5.4.3 of Simon's GAM book has:

BB0B45E9-91FB-4F35-B194-73D0F06C1446

which I think is clear. This explains the findings in my comment above. I have assumed that the rank deficiency of $S$ is what leads to the improper prior when we view penalized smoothing from a Bayesian perspective. But I should read a little more about that...

Here's some clearer evidence that the smooths have a null space. This needs the very latest version of gratia from github but equivalent results can be achieved with mgcv::smoothCon(..., diagonal.penalty = TRUE, absorb.cons = TRUE).

We can rotate the basis functions using an eigendecomposition to reparameterize the smooth such that it has a diagonal (identity) penalty. The null space is then highlighted because the diagonal elements are zero for the null space.

remotes::install_github("gavinsimpson/gratia")
library("gratia")
library("patchwork")

df <- data_sim("gwf2", seed = 2)

p1 <- basis(s(x, bs = "bs"), data = df,
    constraints = TRUE, diagonalize = TRUE) |>
  draw()

p2 <- penalty(s(x, bs = "bs"), data = df,
    constraints = TRUE, diagonalize = TRUE) |>
  draw()

p1 + p2 + plot_layout(ncol = 2, nrow = 1)

penalty-null-space-fig

Here, we're using a B-spline basis and have absorbed the sum-to-zero identifiability constraint into the basis and then reparameterized to diagonalize the penalty. Absorbing the identifiability constraint means the constant / zero function is not in the span of the basis. But the linear functions remain in the span of the basis. This is clear from the reparametrisation shown above, where we see an actual linear basis function and a zero row/column in the penalty matrix.

The reparameterisation doesn't change anything with respect to the original basis, it just highlights the null space of the basis more clearly.

After some extensive chat here, online and offline (by nice coincidence at the same conference), we've got somewhere on this (the improper prior business)... I'll write this collectively for both our findings but if @gavinsimpson needs to correct me anywhere, I trust that will happen.

if we have a 2nd order derivative penalty, functions of the form intercept + slope are in the null space of the penalty
that doesn't mean that intercept + slope have to be explicitly in the basis definition, just that whenever the spline looks like a slope, it will not be penalized.
- you can observe this by taking a B-spline basis (which doesn't have a slope term) and fitting a model where you specify a numerically infinite smoothing parameter (sp=10e10, e.g.) -- we have a straight line, even though we don't have an explicit slope term
when we absorb the intercept for each term into the "global" intercept (which we must do for identifiability in the non-fully Bayesian, informative priors case) we remove that intercept from each term, BUT there still remains the slope null space
improper priors occur when we have non-finite variances (amongst other reasons), these occur when we have approximately zero penalties (since the covariance is the inverse of the penalty, large variance -> small penalty) which occurs when we have a rank deficient penalty -- when there is a null space.
there is a little subtlety about whether we talk about the prior for an individual smooth term vs. the multivariate normal prior for the whole model (taking a kind of Gaussian process view of the model, each bit is Gaussian, so the whole is one big Gaussian).
there are some limited cases where we have a proper prior on individual smooths, for example:
- when using the ts and cs bases (where there is a small penalty added to the null space terms)
- when using select=TRUE additional penalty matrices for the null space terms
- when using a lower-order penalty, with certain bases (tp, ps, bs, maybe others), where the slope is not in the null space and the intercept is removed with absorbing constraints.

Does that cover everything @gavinsimpson?

Then two additional comments because @gavinsimpson mistakenly asked me to look more closely at his equations...

Generally, I think folks use $\mathcal{L}$ to be the likelihood at define $\ell = \log_e\mathcal{L}$, so it might be "better" to write your $\ell$ as $\ell_p$ or somesuch?

(this list interrupted by my realising that github now renders $\LaTeX$, wow! That's going to make the next bit easier...)

on combining the $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}_j$ into $\boldsymbol{\beta}$: I think this is fine, you can say you do that and say that $\boldsymbol{\beta}_j$ refers to the appropriate parameters relating to term $j$ in the model, then I think that's clear enough.

@dill

The formulation for the GAM (first display equation) doesn't allow for multivariate smooths.

Fixed; Dave and I spoke about this while at ISEC this summer and we couldn't really think of a better way to represent the most common form of a GAM than tediously write out multiple smooth functions $f_j()$ in the equation. So, I have updated the manuscript to match Dave's suggestion and usage in his Bayesian interpretation of GAMs preprint.

...In addition you only talk about scale parameters as the other parameters of the distribution.

This was intentional; just wanted to talk about a standard GLM-like GAM. I have now written out the Tweedie distributional model in full later.

"GAMs fitted by mgcv are an empirical Bayesian model with an improper multivariate normal prior on the basis function coeficients." Is that always the case? One might argue that the intercept is always in the nullspace of every penalty, but one could remove that from the model and have a proper MVN prior? Could just put the improper in brackets with the word "usually"?

Done.

"which, with the default penalty, is the integrated squared second derivative" -- I don't think this is quite right.

Done.

Is Figure 1a the basis functions or the basis functions multiplied by their corresponding coefficients? I think it's the latter?

Yep, it's the latter. Fixed.

In previous snippet, coefficients, not "coeficients"

Fixed

Miller (2019) should be Miller (2021) -- I updated it on arXiv.

Done

"The response is assumed to be conditionally distributed Tweedie" -> "The response is assumed to be conditionally Tweedie distributed"?

Done

"Model coefficients and smoothing parameters are estimated using restricted maximum likelihood (Wood, 2011)" missing period at end of sentence?

Done

Explanation of ctrl <- gam.control(nthreads = 10)?

Have added a small explanation

"which show significant heteroscedasticity and departure from the condtional distribution of the response given the model" -- could highlight that we see this due to the increasing spread in the deviance residuals wrt values of the linear predictor in the top right plot?

Done

"given the absence of important effects in the model" not sure what this means?

I was trying to convey the point that given we don't have many potentially important covariates, the original model may be too inflexible to capture the variation that would be explained by the missing covariates if we had them. If have added

"which may be too inflexible if missing covariates mean that important effects are not included in the model"

to try to better explain what I mean.

Done

In the previous snippet "condtional" -> "conditional"

Done

In the Tweedie LSS model lead-in, it looks like you're using \varphi rather than \phi for the scale parameter (as you did in the intro), not sure if this was intentional?

It was intentional; I used $\phi$ for the GLM dispersion parameter, while I use $\varphi$ for the scale parameter of the Tweedie. I think the twlss() model still has a dispersion parameter but, like the Poisson, it is fixed to equal 1.

But, it was a little confusing. I think I have things right now; I still want to distinguish between the dispersion parameter (it just happens to be the Tweedie scale parameter in model m1) in general when I write the general GAM equation and the Tweedie scale parameter when I am talking about the Tweedie distribution and the distributional GAM, which is more specific.

"little data wrangling, we can produce an uncertaintry estimate ,using fitted_samples() to" misplaces comma.

Fixed

Miller (2019) should be Miller (2021) In the references: ... [mistakes in the bibliography]

I have now fixed the ones you found. I missed the brms one and the other things were weird entries in my ref manager metadata. I'll do another sweep once the reviews are concluded.

Generally, I think folks use... (from a more recent comment) on combining the...

Done and I think Done. Perhaps you could give this last part (the one about combining the $\gamma$s in with the $\beta$s

@dill Thanks all the comments and suggestions and importantly taking the time to talk through some of the prior / null space things we both touch on in the comments above.

I've addressed all your comments (see above) bar 1:

It would be nice to have a legend in Figure 1 so readers can relate the basis functions to the values in the penalty matrix (and e.g., see that the most wiggly function has the highest penalty)? Open to push-back on this, if you want to just give default plots from gratia

Done now. The basis function plot was not a plot that was generated by one of gratia's plots, the penalty plot was but the basis function one needed some level of customisation anyway, so I added labelled lines.

I'll push this and then GH Actions will render the pdf to check. Will need to be looking in the joss-paper branch.

gavinsimpson / gratia

Comments on JOSS paper #296