How to protect against overfitting?

hmullerlandau commented 4 years ago

I'm concerned that testing many different climate variables, and fitting flexible models, increases the risk of overfitting. What about trying a cross-validation procedure to avoid this? Reserve some proportion of the data (perhaps 10% of the years and 10% of the individuals) as an evaluation dataset, fit the models on the rest of the data, predict values for the evaluation dataset, and evaluate the model in terms of the fit to the evaluation data. Repeat 100 times.

Just did a quick ref search and found the following refs that might be useful:

https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881 https://besjournals.onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00170.x https://www.tandfonline.com/doi/abs/10.1080/11956860.2000.11682622

teixeirak commented 4 years ago

Thanks, @hmullerlandau!

This might be tough because of computation time. Earlier runs were taking hours per site. It's improved now, but I'm not sure by how much. I'll let @ValentineHerr comment.

ValentineHerr commented 4 years ago

Overfitting

I am not too concerned about overfitting anymore because we have greatly reduce the amount of variables we try in the gls model. We are considering all variables at first, in climwin, one by one, to find out 1/what time window is best to average them over, and 2/ what are the top 2 or 3 variables we should consider (from 3 categories of variables) in our multivariate analysis (gls). So we never test more than 3 climate variables + dbh at once. We do have a pretty flexible model (2 knot splines) but I think it is biologically relevant as most organisms should have an optimum climatic niche. Of those 3 or 4 variables, we only keep the ones that have sum of AICc weights >90, looking at a set of models that give even chance to each variables. That does drop some variables for some species.

We might be concerned about overfitting again when we try interactions (but again, we should only try the ones that would make biological sense).

Transferability

We do want to assess for transferability though. And maybe the framework I'll try to come up with could be used for model selection (as opposed to the AICc weights I was mentioning above). But using it for model selection would mean that our focus would be more about making predictions rather than describing mechanisms. @teixeirak, I''m not sure that is our case. Cross-validation should be feasible but it might be tricky as we have some rules like "species with fewer than x individuals should be dropped" or "cores with fewer than 30 years should be dropped" etc... that would potentially prevent models from running for some subsamples.

teixeirak commented 4 years ago

I think it will be more appropriate to get into the transferability issue with @rudeboybert's project (developing ecological forescasts for tree growth). If it can be kept quite simple (and easy to implement), we could do it here, but our focus is definitely on describing mechanisms, and frankly this analysis is already so involved (and already tough to describe within journal length limits) that I'm hesitant to add more.

ValentineHerr commented 4 years ago

I agree. I'll keep this in mind, though, if I have time left after we are done with everything else (which I doubt).

rudeboybert commented 4 years ago

I will actually be talking about spatial cross-validation in my presentation on Friday's SCBI group meeting. I will keep this discussion in mind. But a couple of early thoughts:

@hmullerlandau I'm thrilled to see you referenced Roberts. It was a linchpin of a PLOS One paper my collaborator David Allen and I published earlier this year (Dave is a PI at the BigWoods MI ForestGEO site). In particular, I found the image below very edifying
Indeed, as @teixeirak pointed out, the run-time is one con to our approach
@ValentineHerr Forgive me as I may be addressing a separate issue here, but we encountered issues with cross-validation in the presence of rare species. The way we got around this is by making predictions based on a very simple bayesian linear regression model. That way small sample "data poor" species could lean on a prior distribution for posterior estimates.

hmullerlandau commented 4 years ago

The fact that many variables are tried initially raises the risk of overfitting even if only models with 3-4 variables are tested later. If someone tests 100 variables, 5 will be significant at p=0.05 by chance alone. Even if subsequent analyses focus only on those 5, that doesn't change the fact that they were arrived at by considering a much wider range of variables and models, and indeed, in many ways that makes it worse, as it obscures the proper interpretation of the resulting p-values.

Valentine, what do you mean by "Of those 3 or 4 variables, we only keep the ones that have sum of AICc weights >90"? Akaike weights are between 0 and 1, so do you mean 0.90 (i.e., 90%)? And what do you mean by weights for individual variables? Akaike weights are calculated for individual models, which include multiple variables, and variables appear in multiple models, so how do you arrive at weights for variables?

Note also that Akaike weights are fundamentally dependent on the set of models being considered. That is, they are arrived at in terms of the likelihood of one model divided by the total likelihood of all models being considered. If fewer models are being considered, then the Akaike weight of any given model will be higher. If more models are being considered, then the Akaike weights will all be lower. So the definition of the set of models that is being considered is critical. What is the set of models exactly?

ValentineHerr commented 4 years ago

the set of models is all possible additive models given the set of variables we test. So if we have variables A, B and C it will be: Y ~ 1 Y ~ A Y ~ B Y ~ C Y ~ A + B Y ~ B + C Y ~ A + C Y ~ A + B + C For each variable, we sum the AICc weights of the models it appears in.

About your first paragraph, we are not looking at p-values and are not saying "variable A is significant". We are ranking each variables using AIC and keeping the ones that lower the AIC the most in each variable group, to decide, e.g, if we should keep minT, maxT or meanT in the next step (which is the one involving AICweights). We know it is not ideal and that e.g. some variables might become important only in the presence of others (so should not be ruled out in a univariate analysis) but we just can't test everything (especially with obvious collinearity issues) so we have to reduce the set of potential variables somehow in the first place (this first step also helps us decide what months each variable should be averaged over).

Does that address your concerns?

hmullerlandau commented 4 years ago

Thanks for clarifying the approach.

How did you decide on the threshold of 0.9 for the AICc weights?

hmullerlandau commented 4 years ago

I will actually be talking about spatial cross-validation in my presentation on Friday's SCBI group meeting. I will keep this discussion in mind. But a couple of early thoughts:

@hmullerlandau I'm thrilled to see you referenced Roberts. It was a linchpin of a PLOS One paper my collaborator David Allen and I published earlier this year (Dave is a PI at the BigWoods MI ForestGEO site). In particular, I found the image below very edifying

Indeed, as @teixeirak pointed out, the run-time is one con to our approach

@ValentineHerr Forgive me as I may be addressing a separate issue here, but we encountered issues with cross-validation in the presence of rare species. The way we got around this is by making predictions based on a very simple bayesian linear regression model. That way small sample "data poor" species could lean on a prior distribution for posterior estimates.

@rudeboybert - that PlosOne paper looks very interesting! And definitely like an excellent way to avoid overfitting. Nice that you took advantage of analytical solutions to make it all computationally feasible.

rudeboybert commented 4 years ago

@hmullerlandau thanks for the kind words. And yes, if we had to resort to MCMC for posterior estimates instead of analytic values via matrix multiplication, it would've made an already computationally intensive procedure even more so!

ValentineHerr commented 4 years ago

Thanks for clarifying the approach.

How did you decide on the threshold of 0.9 for the AICc weights?

To be honest I don't remember, like a lot of other threshold it is a bit subjective. I think we had it at .95 at some point and we might go back to that, to be more conservative. I don't think I saw anything consistent in the literature. We are open for suggestions if you are familiar with this. It is only one line of code so it is very easy to change.

ValentineHerr commented 4 years ago

We output a plot that shows the sum of AIC weights for each variable. It looks like they are often either very high or quite low so I don't think the threshold is very sensitive. I could investigate more when I have the time.

hmullerlandau commented 4 years ago

I asked because I don't remember coming across this type of sum of Akaike weights as a criterion for choosing what variables stay in a model before. I just did a quick search, and the first hit was the following articles that critique this as a criterion

https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12251 https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12835

ValentineHerr commented 4 years ago

thanks @hmullerlandau, I didn't know these papers but I am aware of the misconceptions associated with the sum of AICc weights and I don't think we are using it in the ways that are criticized. We will be extra cautious in describing our methods and interpretations, or we will implement alternative solutions all together if time allows.

teixeirak commented 4 years ago

I think we've dealt with this adequately.

EcoClimLab / ForestGEO-tree-rings

How to protect against overfitting? #43

Overfitting

Transferability