forc-db / Global_Productivity

Creative Commons Attribution 4.0 International
2 stars 0 forks source link

Second order polynomials don't really fit these relationships. Should logarithmic and/or asymptotic models be tested? #76

Closed hmullerlandau closed 4 years ago

hmullerlandau commented 4 years ago

Polynomials almost never look like the right model here, even when they are supported over linear models. Usually it looks like the true relationship is asymptotic. Should we be trying other models in addition? That is, logarithmic or asymptotic models with the same number of parameters (2 and 3) as the models that are being fitted?

To check whether there is really a problem, I strongly recommend graphing residuals vs. the independent variable. If for polynomial fits these graphs often show a pattern of overprediction on the ends and underprediction in the middle (or the reverse), then that would support the idea that fundamentally second order polynomials aren’t capturing the underlying relationships.

In this case, I suggest trying asymptotic and logarithmic models instead. Logarithmic models: just do the linear regression against log(x) instead of against x. Asymptotic models: It looks like there are now several simple ways to implement simple asymptotic model fits in R, fitting models with 3 parameters (same as a 2nd order polynomial regression). Typically the fitted model includes one parameter that is basically the asymptote, and two parameters that determine the speed of approach to the asymptote and for what values of the independent variable the response variable asymptotes. Here’s one implementation that looks fairly straightforward: https://astrostatistics.psu.edu/su07/R/html/stats/html/SSasymp.html More info on these and other alternatives: https://www.statforbiology.com/nonlinearregression/usefulequations

To be clear, I think ideally if we are going to test models other than just simple linear models (2 parameters), then we should consider not just 2nd order polynomials as alternative (3 parameters), but also a logarithmic model (2 parameters) and an asymptotic model (3 parameters), which seem more realistic for many biological processes and don’t involve any more parameters than a 2nd order polynomial.

Another alternative would just be to drop the 2nd order polynomial fits. They almost never look like they are getting the shape of the relationship right. The simplest approach that could easily be implemented without digging into methods for fitting asymptotic functions would be to simply compare linear and logarithmic relationships (so replace polynomials with logarithmic in the code). Ideally asymptotic models would be tested too, but probably logarithmic models would capture things almost as well much of the time.

teixeirak commented 4 years ago

This will affect Fig. 4 and S4-S7, with minor implications for the text/ interpretation.

Assuming it can be done fairly quickly, I think it would be good to compare the fit of logarithmic models to polynomials, as the difference between peaked and asymptotic relationships is biologically interesting/ significant.

beckybanbury commented 4 years ago

For log transforming data which is <=0 (e.g. some of our MAT values), what is the best approach - to remove this data or to transform the data to make it >0 ( e.g. log(x + 10))?

teixeirak commented 4 years ago

Hmmm.... I hadn't thought of that. I think the best would be to transform it, although of course that can really influence the shape of the relationship (e.g., x+10 would give a very different result than if you were to convert to degrees Kelvin). Let's try x+10 (or whatever is needed to make all values positive), and at the same time hope that lognormal isn't the best fit for variables that need transforming. Perhaps @hmullerlandau has a comment here.

hmullerlandau commented 4 years ago

I guess on first principles one could argue for using temperature in Kelvin rather than Celsius. Because this means we basically end up with a range of 270 to 310, log temperature in kelvin is almost linearly related to temperature in Celsius, so we would expect almost exactly the same explanatory power as linear temperature. Of course, log(MAT in kelvin) is still slightly curvilinear relative to MAT, so it would still be a different model...

Of course, for plant growth, it's really temperatures above freezing that matter, so the Kelvin zero seems too far removed.

Looks like the minimum MAT in the data are around -10. So, yes, log (MAT+10) or such seems like a reasonable approach. Of course this should be applied to all site MAT values, not just the negative ones.

Are there any other independent variables that take value zero? The offset should be determined on a case by case basis.

teixeirak commented 4 years ago

I agree with everything Helene said.

beckybanbury commented 4 years ago

It's only MAT which has values <= 0. I applied log(MAT + 11) - this is the best model only for BNPP root. Updated graphs, with log model comparisons included have been pushed to here (the grid_plots files), and the csv file output is here.

teixeirak commented 4 years ago

Thanks, @beckybanbury , this looks good. It looks like adding the logarithmic option provides at least some meaningful improvement, so we should definitely keep this.

One small bug: please remember to back-transform the MAT values when a log fit is applied (e.g., upper left panel in grid_plots_climate2.png).