Are we putting too many climate variables in the GAM?

teixeirak commented 4 years ago

@ValentineHerr, @crollinson, (and @cpiponiot and @rudeboybert, if you'd like to comment),

this figure shows some climate responses that don't make sense biologically for species with >3 climate variables in the model. The most pronounced example is LITU, showing higher growth at the lowest and highest precip and cloud cover values, and an exaggerated response to n wet days relative to the other species.

Are we allowing too many climate variables? Perhaps we should have stricter criteria for how many we put in the GAM?

cpiponiot commented 4 years ago

You have probably already checked this but do you have a graph showing the relationship between the climatic variables?

teixeirak commented 4 years ago

Yes, Valentine looks at the correlation among variables to eliminate variables that are too correlated before putting them into the GAM. She can give the details. I'm wondering if the criteria for inclusion should be stricter, though.

crollinson commented 4 years ago

A couple thoughts:

Particularly with LITU, the confidence interval is quite large (which is probably good), so I suspect that despite the U-shape, it's not actually sensitive across that full range. In our paper we're defining "sensitivity" as the slope of the line (1st derivative) and whether or not it's significantly non-zero over different portions of the range.
You might consider making the DBH spline more rigid. If this is being fit at the species level (rather than individual), I'm not sure the S-shaped curve for QURU(?) makes biological sense. I can come up with reasons for unimodal curves, but bimodal is trickier.
I'm in the process of digesting the different variables and what pops as significant from the ClimWin for different species. I think there's some interesting ecology there, but I'm still puzzling through. Any chance there's a table of what is being used for which species?
Finally, you might consider modifying the model that's being used for the residuals that go into climwin. If I'm remembering correctly, I think it the residuals were based on a model with individual-based spline year spline, correct? If so, what would happen if you did both the individual year spline and the species DBH spline so that it mirrors your non-climatic effects of the full models?

teixeirak commented 4 years ago

* You might consider making the DBH spline more rigid.  If this is being fit at the species level (rather than individual), I'm not sure the S-shaped curve for QURU(?) makes biological sense.  I can come up with reasons for unimodal curves, but bimodal is trickier.

@ValentineHerr, I'd second this.

teixeirak commented 4 years ago

* I'm in the process of digesting the different variables and what pops as significant from the ClimWin for different species.  I think there's some interesting ecology there, but I'm still puzzling through.  Any chance there's a table of what is being used for which species?

@crollinson, the same variables are used (as candidates) for each species. These are selected based on climwin output, and then Valentine removes correlated variables. I'll let her specify. In the figure, relationships are plotted only for those that come out as significant.

teixeirak commented 4 years ago

* Finally, you might consider modifying the model that's being used for the residuals that go into climwin.  If I'm remembering correctly, I think it the residuals were based on a model with individual-based spline year spline, correct?  If so, what would happen if you did both the individual year spline and the species DBH spline so that it mirrors your non-climatic effects of the full models?

@ValentineHerr, I think it would be good to keep track of what we get with a few different analysis options (like this). Once we have a few, we can select which seems to be best, and others may be candidate for mention in SI materials.

@crollinson, do you think it would make sense to convert to biomass increments for the climwin stage? Valentine and I were wondering about that yesterday.

crollinson commented 4 years ago

@teixeirak @ValentineHerr I think if we're ultimately interested in predicting biomass increment, it probably makes most sense to do that at the beginning and use it as our response variable in all of the stages (e.g. use it in climwin). I think we've seen that because multivariate climate sensitivities are tricky, it's probably best not to add potentially confounding effects of mis-matched responses to the mix.

teixeirak commented 4 years ago

@crollinson, makes sense. @ValentineHerr, I do think its worth saving the results both ways --that is, for biomass increments and radial increments, but keeping the same response variable for climwin and the GAM.

ValentineHerr commented 4 years ago

update: using a lower k for dbh spline (k=3) as suggested by @crollinson .

For Biomass increment, I need to spend more time on it because it seems that i have bad allometries for cato and fagr (they are really messing up the models). So I'll try to start using allo_db, which will take a while to transition too.

For now, this is what I have, excluding those 2 species. (no species has pre in the best models). But I wouldn't start interpreting too much.. this is a rough pass and I need redo this more carefully.

crollinson commented 4 years ago

@ValentineHerr Regarding the ABGi: Yikes! I'm not saying you did something wrong, but that doesn't make ecological sense at all, particularly with cloudy days being the most consistent predictor. (Although maybe it does at this scale since cloud = light = photosynthesis? Its definitely counter to most past studies though.) The DBH curve makes total sense to me in this one.

Thinking "out loud" here: I think this really confirms my suspicion that the more we scale from ring width variability into units with square or cubic functions, it really changes the distribution of the inter annual variability and exacerbates the difference in the tails of the distribution, probably in a way that inflates "good years and the minimizes the difference between average and "bad" years. The standard tree-ring approach is to focus on the "bad" years because they tend to have the most consistent signal, which is why most standard methods have been really focused on that aspect of tree-ring patterns. If you're not doing it already, maybe saving a simple histogram of the residuals going into climwin would help diagnose what's going on?

teixeirak commented 4 years ago

I agree that I don't trust the biomass analysis yet (of course, neither does Valentine).

The biomass looks good for most species, but I'm suspicious of the QURU, QUVE, and JUNI allometries. Integrating allo-db will definitely be useful.

I wonder if the fact that AGB increment is so much more variable with tree size would introduce bias (because of systematic increases through time)? Also, does the detrending spline work as well on biomass?

Regarding cloudy days, I think it could make some sense. From the Helcoski analysis, cloudy days came out pretty important. I think they integrate moisture (more clouds -->more precip) and temperature (clouds buffer T extremes). They also affect the diffuse/ direct radiation ratio, which eddy flux studies have shown matters.

teixeirak commented 4 years ago

@crollinson, here's the latest version based on biomass increment! Results are now roughly in line with what we get from biomass increment. Biomass allometries will still be improved.

teixeirak commented 4 years ago

The question remains whether we should tighten the criteria for inclusion of climate variables. The responses to cld don't make sense biologically and make me suspect that we need to tighten these criteria. @ValentineHerr, we should probably discuss this in person.

teixeirak commented 4 years ago

@ValentineHerr , I'm returning to this question (which has been bothering me) of whether/how to further limit the number of climate variables in the GAM. The cld responses do not make sense to me biologically, and also are not consistent with the climwin results (see below). Potential solutions to come in the next comment.

teixeirak commented 4 years ago

Potential solutions (not in order of preference):

Only include variables that come out as significant in the monte-carlo analysis in climwin. We stopped doing this because it is computationally very time-consuming, right? (Would it be feasible to use the hydra cluster for that?). The downside to this is that its computationally intensive, and not really necessary in that the main test of variable significance comes later.
Alternatively--and by far easier--perhaps it would make sense to base variable selection on dAIC. From the file names, I'm noting that for SCBI: PRE>WET>CLD>PET>DTR>TMX>TMP>TMN Remarkably (and reassuringly!), the dAIC order follows the same order regardless of whether we run this on AGB increment or core measurements. We could potentially limit candidate variables to those in the top 4-5 dAIC (and then remove overly correlated variables). Big downside for SCBI: this would eliminate the temperature variables. Interestingly, TMN has the lowest dAIC but a very consistent response across species.
Another option could be to return to our earlier idea of grouping moisture and temperature variables, and then letting climwin pick the best. We could perhaps have it pick the best and then second-best, as recommended in the climwin publication for cases where you want to consider multiple climate variables.
In addition to any of the above, I think its useful to give more careful thought to which variables we really want to include, based on both biological mechanisms and how they are derived in the CRU database. That's a task for me.
Another option, which I'm not so crazy about but which would be easy to justify, is simply to go with PRE and TMP, as is the typical convention.

@ValentineHerr, what do you think of those options?

teixeirak commented 4 years ago

Relationships among CRU variables, from Harris et al. 2020:

primary, secondary, and derived variables:

teixeirak commented 4 years ago

@ValentineHerr, based on the relationships among variables in the CRU database (above), I'd like too try the following:

For the climwin stage, group the variables as follows, selecting the best from each group:

Precip group (describes precip and its frequency): PRE, WET
TMP group (describes temperature): TMP, TMIN, TMAX, PET
DTR group (describes variation in temperature, linked to cloud cover): DTR, CLD, PET

PET is in both the TMP and DTR groups. If it comes out as the best in both groups (should always be for the same time frame), then there are only 2 candidate variables for the GAM.

The 2-3 candidate variables for the GAM should then go through the same process of checking for collinearity. If there is redundancy, remove the one with the lower dAIC.

ValentineHerr commented 4 years ago

This new way of doing it is currently running but for collinearity issue, I have the code remove any variable with VIF > 10 (more standard way than looking at AIC). Also FYI, I stopped having climwin check for linear relationship for 4 reasons:

it speeds up the process,
quadratic is I think always the one with lower AIC and if it is not there is probably not much difference in AIC and the curve would look like a straight line anyway -quadratic makes more biological sense
it would be too tricky to code the gam to adapt to any circumstance

teixeirak commented 4 years ago

Great! I look forward to seeing the new results.

ValentineHerr commented 4 years ago

I pushed the new figures. Note: I still need to figure out what is going on with tmx at Havard, I don't know why BCI shows up more now, and I removed the sum of AIC weights figures because there is a bug there. Otherwise most figure look like a subset of the ones we had before, with a few differences.

Also, I'll implement the new folder organization you were talking about when I code to run through the different types of analysis (still need to fix things in the dendro repo for this to happen)

teixeirak commented 4 years ago

Thanks! I'll give careful feedback ASAP.

It's good to see that results seem to be pretty stable across different methods of selecting variables, etc.

ValentineHerr commented 4 years ago

Pause. I did not realize but since I've changed the data source to pull it from the dendro repo, treeID is not considered as a factor in the GAM... I feel that I made the same mistake before... I should have realized it because it was running much faster. I am changing it now but that will take hours (days?) to run.

ValentineHerr commented 4 years ago

Good thing @crollinson asked to look at individual response curves! I don't think I would have noticed otherwise...

teixeirak commented 4 years ago

This seems to be resolved.

teixeirak commented 4 years ago

We have a case from SCBI where PET would have come out at the top of the T group, CLD at the top of the Cloud group (beating PET), and both were retained. Results looks suspicious:

@ValentineHerr , I want to modify the variable retention method a bit. If PET is selected as the top variable in either the T or CLD group, but NOT as the top variable in the other group, then we want to compare the AIC of the top variable in those two groups and retain only the stronger one. In this case, we know that CLD beat PET in the cloud group, so PET would be removed. Hopefully doing so will make GLS results align with climwin (#38).

ValentineHerr commented 4 years ago

@teixeirak, do we agree that this is the equivalent of saying "if PET does not come out significant in both T and CLD group, drop it?)

teixeirak commented 4 years ago

Yes, that's a simpler way to put it! (but minor modification: "if PET does not come out as top variable in both T and CLD group, drop it"

ValentineHerr commented 4 years ago

right, that is more accurate, thanks!

EcoClimLab / ForestGEO-tree-rings

Are we putting too many climate variables in the GAM? #14