Figure out best random effect specification in GAM

ValentineHerr commented 4 years ago

I now think that the way I specified the random effect in the GAM (s(Year, bs = "re", by = treeID), following this webpage) is not actually doing what we want it to do... When I plot the prediction of core increment against Year for one particular individual, the only way I can have it look like a smooth spline is if I specify the gam as s(Year, by = treeID), which removes the random effect part (be = "re").

@crollinson used function gamm instead of gam, and has s(Year, by=PlotID) but also random=list(PlotID=~1, Canopy.Class=~1, TreeID=~1) which is probably the best way to do it but it is very computer intensive. I am running it at Scortty Creek right now but it is taking forever. If we go with it we will have to drop the "Sum of AIC weight" thing that we use to further thin the most significant variables.

As an alternative, I need to read and investigate more the paraPen argument of the gam function. It can apparently be used for random effects but not in a straightforward way.

ValentineHerr commented 4 years ago

This thread seems to suggest that we would like option 4: s(Year, treeID, bs = "fs")` I'll try that.

ValentineHerr commented 4 years ago

Yay!, s(Year, treeID, bs = "fs") works and it seems to be quite fast ( a few seconds - @crollinson's way is still running since I first talked about it here). If we trust the person (Gavin Simpson seems reliable) it might be our best option. @teixeirak, can you confirm that the answer for option 4 in the thread I mention above is what we want ? and we don't want to include a population-level effect, correct? otherwise it would be s(Year) + s(Year, treeID, bs = 'fs')

my only hesitation is that this page says :

This ['fs'] class produces a smooth for each level of a single factor variable. Within a gam formula this is done with something like s(x,fac,bs="fs"), which is almost equivalent to s(x,by=fac,id=1) (with the gam argument select=TRUE). The terms are fully penalized, with separate penalties on each null space component: for this reason they are not centred (no sum-to-zero constraint).

I am not sure what "null space components" means but I am concerned about id = 1

id is used to give smooths identities: smooths with the same identity have the same basis, penalty and smoothing parameter (but different coefficients, so they are different functions).

This makes me think about what @crollinson said: short series should have fewer knots than long series

but maybe the select = TRUE deals with that?:

the gam can add an extra penalty to each term so that it can be penalized to zero. This means that the smoothing parameter estimation that is part of fitting can completely remove terms from the model. If the corresponding smoothing parameter is estimated as zero then the extra penalty has no effect.

Overall I am not sure what it means ...

so I see 2 options:

trust the thread and go with s(Year, treeID, bs = 'fs')
try something like: s(Year, by=treeID, id=serie_length) with serie_length being a classification of each core based on their length, something like every 10 year we allow a different number of smoothing term... I don't know about select = TRUE since I don't understand it well.

I'll already try option 1 since it seems easy and since I am leaning towards trusting Gavin Simpson. We could also potentially ask him about or issue of different time series length.

ValentineHerr commented 4 years ago

(I might have been overoptimistic about speed of this new way... I was not comparing the same data set...)

crollinson commented 4 years ago

Hi @ValentineHerr a couple quick notes: 1) I absolutely trust Gavin Simpson. His code is what started me down this track and while I've only had limited interaction with him, he seems like a solid guy. I haven't played around with the "fs" or select/id statements, but will look into them.

One thing to also note is that in everything I've done, spline effects are 0-centered, so you probably need to make sure to add a fixed or random intercept for most things in the "by" term for the splines. I can't remember if you did this or not already, but I made that mistake recently.

The random effects does take forever and we had a heck of a time getting the model to converge. This was the sticking point with our last round of reviews. We eventually had to bite the bullet to split out individual species and it worked. If you're running on a normal laptop or even desktop, it might be prohibitively cumbersome. It was taking my iMac forever to fit the hierarchical effects structure, BUT if you have/can get access to an HPC or something along those lines it's not too bad -- my server was able to crank out the mixed models in a fraction of time as my desktop because I've optimized it for that (no GPU processing power, etc).

ValentineHerr commented 4 years ago

Thanks @crollinson for the feedback.

"fs" does seem to be faster (see code below) but right now both ways seem to take forever for Scotty Creek, I feel like I am crashing R, I must have mis-coded something in that other script. I'll get back to it with a fresh mind next week.

This is processing time for one species at BCI, with 51 different trees.

>   system.time(m_christy <- gamm(log_core_measurement ~ ns(pet, 2) + ns(wet, 2) 
+ s(Year, by = factor(treeID)), random=list(treeID=~1),
 data = Biol[Biol$species_code %in% sp, ]))
   user  system elapsed 
  17.08    0.70   17.78 
>   system.time(m_fs <- gam(log_core_measurement ~ ns(pet, 2) + ns(wet, 2) 
+ s(Year, factor(treeID), bs="fs"),
 data = Biol[Biol$species_code %in% sp, ]))
   user  system elapsed 
   7.75    0.00    7.75

When you say

to add a fixed or random intercept for most things in the "by" term for the splines

Do you mean that in my code above ns(pet, 2) + ns(wet, 2) should be more like ns(pet, 2, by = treeID) + ns(wet, 2, by = treeID) ?

Still thinking out loud here... Folowing the different options by Gavin Simpson here, maybe, if it works computationally , the ideal specification should be: gam(log_core_measurement ~ ns(pet, 2) + ns(wet, 2) + s(treeID, bs = 're') + s(Year, treeID, bs="fs") ? (ignoring any random slopes on climate variables).

teixeirak commented 4 years ago

@teixeirak, can you confirm that the answer for option 4 in the thread I mention above is what we want ? and we don't want to include a population-level effect, correct? otherwise it would be s(Year) + s(Year, treeID, bs = 'fs')

Yes, I believe this is what we want. To confirm my understanding, we're fitting individual spines on year for each tree, correct? That is what we want.

I don't see any current need for a population-level effect of year for this analysis. But, for future reference (for @biancaglez's project), could we also include year a non-spline effect of year, or year x climate interaction?

crollinson commented 4 years ago

@ValentineHerr sorry for the lack of clarity -- brain's a little slow and I'm migrating to a new computer. What I mean is that if you have something resembling gam(log_core_measurement ~ ns(pet, 2) + ns(wet, 2) + s(Year, factor(treeID), bs="fs") you may want to make it gam(log_core_measurement ~ ns(pet, 2) + ns(wet, 2) + s(Year, factor(treeID), bs="fs") + treeID

Adding + treeID will allow the trees to have mean different growth rates since unless the fs is a different kind of spline than the base, it forces the mean to be 0 so it would essentially be assuming everything has the same mean growth rate and just different temporal trends unless you add either fixed or random intercepts to account for this. Adding a fixed intercept will make it a lot noisier, but it will be faster.

Although now that I think about it more, this might be okay if we can be sure the DBH term is working as we want it to. I think the key difference is whether you're wanting to use the model for inference on your existing trees or whether you'll want to predict on trees that you won't have intercept estimates for.

ValentineHerr commented 4 years ago

could we also include year a non-spline effect of year, or year x climate interaction?

Yes, I think it is possible

teixeirak commented 4 years ago

Although now that I think about it more, this might be okay if we can be sure the DBH term is working as we want it to. I think the key difference is whether you're wanting to use the model for inference on your existing trees or whether you'll want to predict on trees that you won't have intercept estimates for.

I'd hope to be able to use it to predict for other trees. Part of the goal is to scale to forest productivity.

ValentineHerr commented 4 years ago

Thanks for the clarification @crollinson .

I think the key difference is whether you're wanting to use the model for inference on your existing trees or whether you'll want to predict on trees that you won't have intercept estimates for.

I think we want to infer to other trees...

ValentineHerr commented 4 years ago

to close this issue, I settled down with the following model formula (but I might have found something faster, that does not involve GAMs but gives very similar - cleaner? - results... see #29):

uGamm(log(core_measurement) ~ ns(wet, 2) + ns(tmx, 2)+ s(Year, by = treeID), random = list(treeID = ~1), data = x, na.action = "na.fail")

This solution allows one year spline to be fitted for each treeID and has a random intercept for each tree.

That is what the current (as of today) plots are representing (except for Scotty Creek).

teixeirak commented 4 years ago

Thanks! I updated the analysis methods document and will close this issue.

EcoClimLab / ForestGEO-tree-rings

Figure out best random effect specification in GAM #27