An issue with side constraints we should discuss

eric-pedersen commented 5 years ago

I've finished paper tweaks and the paper is almost ready. However, there's one possibly semi-big issue we should discuss as a group before we resubmit. This is something I had realized a while ago, and had chatted with Gavin about it when he was here, but we had thought that it shouldn't be enough of an issue to include in the paper. However, one of the reviewer's comments prompted me to rethink it, and realize that it's likely something we should include in the paper, so I wanted to bring it up for discussion.

The issue is with side-constraints. mgcv will drop terms from nested smooths until those smooths are no longer colinear with the main effects. This becomes a problem with tensor-product in the presence of a global smoother. Let's say we have a model set up like (using the CO2 data as an example):

modGS_te <- gam(log(uptake) ~ s(log(conc), k=5, m=2) + 
                   te(log(conc), Plant_uo, k=c(5,4),  bs=c("tp","re"), m=2),
                 data=CO2_new, method="REML")

For that model, the smoother for the global term would have 4 basis functions (as "tp" has the flat line in its null space, it has to drop one basis function to ensure that the smoother isn't colinear with the intercept). We'd expect the groupwise smooth term to have 20 basis functions (one for each group level for each of the marginal basis functions). However, gam.side will drop 5 basis functions from the groupwise term (one for each basis in the global term), resulting in each group level having a different complement of basis functions from one another. This means that the range of shapes of deviations differs between grouping levels.

To illustrate this, here's a figure for the basis functions (extracted using predict.gam) for four group levels for the model above (top row) compared to basis functions for the same term from a model without the global term (model S in our parlance):

missing_basis

This isn't an issue for models fit with bs="fs", as gam.side is disabled for "fs" terms, as they have no null space (so colinearity shouldn't be an issue). This is also not an issue for individual-level smoothers for models with separate smoothers (model GI), as gam.side treats them all as separate smoothers and does pairwise comparisons between sets of smoothers for colinearity, and since each grouping level isn't colinear with the global term, it doesn't drop any basis functions.

This means that for any model with both a global and a tensor-product groupwise term, both the shape of the global smooth and shape of individual curves will differ from what the equivalent "fs" (or by=group) model would give. To illustrate, here's the same four plants from CO2, fit with a "te" groupwise smoother with a global term (top row), a "fs" smoother with a global term (middle row) and a "by=" smoother with a global term (bottom row), plotting the global term, the plant-specific deviations, and the sum of the two for each model. global_plus_dev

The thing to note is that the global smooth for the "te" term differs quite substantially for the global term for the other models, as do the individual-level smooths (although the sums of those terms don't differ).

This would be straightforward to fix if it was possible for the user to disable gam.side for themselves on a term-by-term basis. However, this is currently not possible.

I think we need to be clear about this in text, and my apologies for not bringing it up earlier in the writing process. I don't think it invalidates what we've done here, it's just an extra caveat. I've drafted a branch to deal with this issue, where I've described the issue on page 40 - 41 of the compiled document (ignore the duplicated footnote for now; I'll fix that when I get a chance) (see branch side_constraint_issue). Let me know if this makes sense, and I can set up a pull request, as well as revise the letter to reviewers.

dill commented 5 years ago

@eric-pedersen can you post the code you used to generate those figures and how you subsetted the data? I'd like to do a wee experiment but want to make sure I'm comparing the same things :)

eric-pedersen commented 5 years ago

No problem! I should have shared that earlier. I just pushed a commit to the branch with the generating code in it (528446a).

gavinsimpson commented 5 years ago

Things seem to work much "better" if m = 1 on the te() smoother in modGS_te:

eric-plot

Is there another way to manipulate this model, say by using separate terms and ti()?

I'm thinking about this now but have to drop out for a meeting so wanted to post this in case you're also working on it.

eric-pedersen commented 5 years ago

m=1 improves it, but it still ends up dropping basis functions. I also don't think it's a general solution for the problem.

I've tried the ti() approach, but unfortunately it also doesn't work here. ti() works for standard interactions by first applying sum-to-zero constraints to marginal terms before building the interactions. However, as random effects marginals are already fully penalized, no sum-to-zero constraint is applied (if it was, it would have the effect of essentially treating one group-level chosen at random as the reference level, so set to zero everywhere, and then the smoother would be comparing each other group against that term). All ti will do in this case is drop the group-specific intercept for each group level.

As no sum-to-zero constraint is applied to the random effect, each individual basis function in the ti term will still co-linear with some combination of basis functions in the global smooth, and gam.side will still drop terms. fs deals with this by just preventing side-constraints from being applied, but there's no way I've found for the user to specify that for individual tensor products.

dill commented 5 years ago

Continuing to think about this but some info that might be useful:

When building a model term smoothCon calls off to the smooth.construct.*.smooth.spec methods to generate the design matrix/other stuff. Part of that "other stuff" is $side.constrain which is TRUE for almost all of the bases described in smooth.r in mgcv. The exceptions are (you guessed it) fs and re. So, it doesn't appear possible to turn off side constraints on a per-smooth basis (as you say @eric-pedersen).

Will continue thinking about how to fix or talk about this.

dill commented 5 years ago

I think I only changed to t2 in the below (previously had been mucking about with bs="ts" in the tensor to no avail... Need to figure out why but I need to go make dinner... screen shot 2019-01-21 at 17 38 47

eric-pedersen commented 5 years ago

Hhhmm. Interesting. Seems to work better, but the global smoother still differs in shape from the 'fs' smoother (which it shouldn't), and playing around with it, it still looks like basis functions are being dropped from the group-wise smoother even when using t2 terms.

gavinsimpson commented 5 years ago

With t2() and full = TRUE the global smoothers are pretty close; differences appear to be in the 3rd dp when I looked, but there are still differences in what is being estimated, with small differences in the estimated smoothness parameters etc. and side.constrain is TRUE in the t2() version.

dill commented 5 years ago

Okay some additional thoughts...

am I right in thinking that the te() approach becomes truly useful when there is >2 covariates to be smoothed in the term using a tensor product (x, y and fac, say)? If so, we can recommend that folks use the fs basis in the x,fac case?
If that is the case then perhaps we do need to think carefully about what happens in the higher-dimensional cases (which is a pain because that's harder to think about).
Not sure I fully follow your commen on fs having no nullspace so this not being a problem in your first post above. I'm not sure this is a nullspace issue, is about the "overlap" of the basis functions?
With that in mind, thinking about advice: "we recommend both careful choice of basis and setting model degrees of freedom so that groupwise terms are either slightly less flexible than the global term or have a smaller null space." (p. 41). I wonder if this is always a good idea. Perhaps there is a case for saying "I want a very simple general model and the group terms are wigglier forms of that base model"?
What about recommending that the ks for the global and group terms be different (need some criteria for this) to ensure that there is minimal overlap?
line 821 in the compiled paper, "concurved" should be "concurve" imo :)

(Sorry, this is more a stream of consciousness than a "solution"...)

eric-pedersen commented 5 years ago

am I right in thinking that the te() approach becomes truly useful when there is >2 covariates to be smoothed in the term using a tensor product (x, y and fac, say)? If so, we can recommend that folks use the fs basis in the x,fac case?

That's right, yes. I would say in general we should always recommend using bs="fs" for one-dimensional HGAMs. In fact, if you are working with 2D isotropic smoothers, you can still use bs="fs" and it works the same. For example, y~s(x1,x2, grp, bs="fs", xt=list(bs= "tp")) would work. It's only when using non-isotropic multidimensional smoothers with grouping that tensor products are needed (which I guess is something we should also mention in text).

If that is the case then perhaps we do need to think carefully about what happens in the higher-dimensional cases (which is a pain because that's harder to think about).

I agree with this; I used a 1D example only because it's easier to illustrate the missing basis functions. I can try and modify the example above to highlight the issue.

Not sure I fully follow your comment on fs having no nullspace so this not being a problem in your first post above. I'm not sure this is a nullspace issue, is about the "overlap" of the basis functions?

It's a combination of both. When you have a model like s(x, bs= "ts") + s(x,group, bs="fs",xt=list(bs="ts")), it's possible to fully reconstruct any value of the global smoother by appropriate choice of coefficients for the groupwise smoother, which means the model, in theory, could penalize the global smoother to a flat line (as it's using a thin-plate spline with a null space penalty). However, that would require mgcv setting the penalty very high for the global term, and low for the groupwise term; REML (from what I understand) prefers more balanced penalties, all else being equal. The easiest case to understand this effect in is with nested random effects. If you have an effect a:b nested in a, these terms are technically entirely colinear (model y~s(a,b, bs="re") will be able to fit the data just as well as model y~s(a, bs="re")+s(a,b, bs="re")) but the REML criteria will balance penalties so that the coefficients estimated for the nested term (s(a,b,bs="re")) will correspond to the sub-group deviations from the mean grouping levels s(a, bs="re"). Does that make sense?

With that in mind, thinking about advice: "we recommend both careful choice of basis and setting model degrees of freedom so that groupwise terms are either slightly less flexible than the global term or have a smaller null space." (p. 41). I wonder if this is always a good idea. Perhaps there is a case for saying "I want a very simple general model and the group terms are wigglier forms of that base model"?

That might actually make more sense... it would definitely minimize the problem of dropping basis functions. If the global smoother has 6 basis functions, and the groupwise term has 20 basis functions per level, I think you would still only drop a total of 6 basis functions from the groupwise term, so each groupwise term would be "closer" to being able to recreate the global smooth term.

What about recommending that the ks for the global and group terms be different (need some criteria for this) to ensure that there is minimal overlap?

There will still be dropped terms, even when k is different, but it might be good guidance to note that there should be higher k in the group-level terms regardless.

line 821 in the compiled paper, "concurved" should be "concurve" imo :)

:) good catch.

dill commented 5 years ago

I didn't know about the points you raise in 3. and was not aware of that feature of REML! Do you have a reference for the latter?

So perhaps a somewhat unpleasant thought is: can we trust the global smooth to recover the "true" global effect or are we always going to be subject to the whims of the relative basis configurations of the global vs. group terms? I think we are subject to the exact setup of the smoother, especially in cases with smaller numbers of basis functions (as the basis function set gets richer, missing a couple of functions out probably makes less difference? (that's entirely a reckon on my part)). The good thing is that the predictions are okay, it's just whether we can trust inference on the components that's the issue.

So, what about saying something like that?

PS. I assume that t2 looks better because of the separation of null/range space in the construction of the product?

gavinsimpson commented 5 years ago

Turning off the normal parameterization of the te() smooth with np = FALSE also seems to mitigate the effect; it's still dropping basis functions but the fits are much closer to the fs-based model now with:

modGS_te <- gam(log(uptake) ~ s(log(conc), k=5, m=2) + 
                   te(log(conc), Plant_uo, k=c(5,4),  bs=c("tp","re"), np = FALSE),
                 data=CO2_new, method="REML")

eric-plot

As with the t2() parameterization, it seems the results are sensitive to the choice of parameterization, but perhaps with a little testing on our examples we might be able to suggest one that works better? Eric, how do the various approaches reconstruct the known global smooth from the data generating model, which is GS (IIRC?)?

It would be good to make the point that Dave makes re informing the model via the anticipated amount of wiggliness in the global vs group smooths.

gavinsimpson commented 5 years ago

Once this is all over, it would be useful to talk to Simon about this, perhaps suggesting implementation of a way to turn off side constraints for a smooth.

eric-pedersen commented 5 years ago

I unfortunately don't have a reference for point 3; it's something I've mostly discovered from a lot of playing around computationally. At one point I thought that demonstrating that would be a good addition to the paper, as I haven't seen it spelled out anywhere, but it's not straight-forward to work out analytically from the REML criteria given in Wood 2017.

I think the question of "when do HGAMs converge on a true model" is really important, and not something we've determined. However, I think it's something that would have to be answered in a future paper; it would just add too much to this paper. I think all we can really warn is that these models can be sensitive to specific model parameterization. The functional regression literature has done a fair bit of work on this (through functional ANOVA), but I'll confess I find the functional regression literature to be very opaque; they just speak a different language than I do.

eric-pedersen commented 5 years ago

And @gavinsimpson: the data for CO2 aren't generated, they're from an experiment, so I'm not sure how well the different approaches reconstruct the "true" average curve. I'd have to play around with simulations to figure it out, and my guess is how well different parameterizations work to reconstruct a given true global smooth will depend heavily on the details of the shape of the true smooth and the group-level deviations.

I had emailed Simon about if it was possible to optionally disable side-constraints back when I first spotted it, in case he had an easy fix, but I never heard back from him, and forgot to follow up.

gavinsimpson commented 5 years ago

@eric-pedersen Sorry I wasn't clear; I meant for the bird movement example. We don't need the te() for the CO2 example but we do if we want an anistopic 2d smooth per species in the bird movement example. As I understand it, the true model there was a GS; if so, how well do the various options

te()
te(...., m = list(1, NA))
te(...., np = FALSE)
t2(...., full = TRUE)

recover that model?

eric-pedersen commented 5 years ago

That's a good idea; I'll give that a try tonight (in a meeting currently). I know how we can test it; look at mean deviation from true for the fitted curves, and generate plots of deviation of mean fitted global curve from the mean curve (if anyone wants to try it before I can, go for it! :smile: ).

eric-pedersen commented 5 years ago

Hi folks,

Sorry for the delay. Finally figured out how to plot these comparisons. I realized when doing this that I hadn't actually used a true global function for this; all of the bird migrations were generated by offsetting and shifting a single function (so there is an implied global function) but I had to calculate it from the true bird paths.

Anyway, here's the plot comparing your four model suggestions @gavinsimpson to the true global function:

capture

What is clear is that none of the models work well for 2D data except the t2() version (which is what bs="fs" is fitting for the one-dimensional case. And it looks like t2() generally recreates the global function pretty effectively! The biggest difference is that the global function estimated by t2() is shifted "up" a bit from the true global function (almost all values are on average a bit higher in the global function estimated in t2).

However, this leaves me thinking that this may be less of an issue than I figure thought for 2D examples; the t2() smoothers (which we recommend using in the paper) still seem to work well for reconstructing global functions.

gavinsimpson commented 5 years ago

That's a bit more reassuring. I would suggest that we mention the potential issue but that in some testing t2() formulation seemed to recover the true effect in the bird model, and then leave it at that, perhaps with a "but further study would be required to evaluate this claim fully". Or some such.

dill commented 5 years ago

+1

Definitely need to caveat and give some chat about how we really can't get to the "true" global effects (if they even exist).

eric-pedersen commented 5 years ago

Okay. Sounds like we have a consensus here that the best way forward is to note the caveat (as well as the issue of "what is a global function really?"). I can extend the text I've already included either tomorrow or this weekend. This should be our last edit then before we can submit:exclamation: :fireworks: :100:

dill commented 5 years ago

Yahoo!

On 24 Jan 2019, at 20:49, Eric Pedersen notifications@github.com wrote:

Okay. Sounds like we have a consensus here that the best way forward is to note the caveat (as well as the issue of "what is a global function really?"). I can extend the text I've already included either tomorrow or this weekend. This should be our last edit then before we can submit❗️ 🎆 💯

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

eric-pedersen / mixed-effect-gams

An issue with side constraints we should discuss #49