forc-db / Global_Productivity

Creative Commons Attribution 4.0 International
2 stars 0 forks source link

The choices of which ratios to use as dependent variables, and the methods for modeling them, seem somewhat problematic. #77

Closed hmullerlandau closed 4 years ago

hmullerlandau commented 4 years ago

I’m concerned about the appropriateness of the statistical procedures used for regressions involving response variables that are ratios or proportions, as well as an issue of redundancy in some of the analyses.

The analyses are currently being done as linear regression with untransformed response variables, but that doesn’t really make sense because the fitted values of linear regressions inherently go from negative infinity to positive infinity, but the observed values of proportions are bounded at 0 and 1, and the observed values of ratios are bounded at 0 and infinity.

I think for ratios of one component to another (e.g., ANPPfoliage/ANPPwoodystem the right approach would be to use the log of the ratio as the response variable, as the log of these ratios has a range of negative infinity to positive infinity, and this also is interpretable analogous to a log odds ratio (as used in logistic regression).

For proportions, i.e., ratios of one component to a total (e.g., ANPP to NPP), one option would be to basically turn these into ratios of one component to the remainder, e.g., ANPP / (NPP-ANPP) and then take the log, so really make these ratios not proportions. Alternatively, as proportions they could be modeled with beta regression.

Then there is the question of which ratios and/or proportions it makes sense to include. Given that ANPP+BNPP=NPP, analyses of ANPP/BNPP, ANPP/NPP, and BNPP/NPP are not independent; they are fundamentally all about the question of allocation aboveground vs. belowground.

That said, some of these analyses are significant for certain variables and some aren’t, so what’s going on here? I think this is partly an issue of the wrong kind of analyses being done (see above), and perhaps partly an issue of which sites are included? In principle, you could calculate BNPP for every site that has ANPP and NPP, as BNPP=NPP-ANPP, but are you in fact doing that? If not then of course different sites are included in ANPP/NPP vs. BNPP/NPP, and that could be part of what is going on.

My recommendation in the case of ANPP,BNPP, and NPP would be that for the purposes of analyses of allocation, you should take every site that has 2 of these variables, and calculate the 3rd. So if a site has ANPP and NPP but not BNPP, calculate BNPP as NPP-ANPP. If a site has ANPP and BNPP but not NPP, then calculate NPP as ANPP+BNPP. Then analyze log(ANPP/BNPP) as the response variable. And don’t analyze ANPP/NPP or BNPP/NPP.

Similarly, for GPP:NPP, I would recommend estimating Rauto as GPP-NPP for all sites that report GPP and NPP but not Rauto, and analyzing log(NPP/Rauto) as the response variable. Or if you really want to analyze the relationship of GPP and NPP with each other, then do it in terms of CUE=NPP/GPP, and use beta regression. But that is more complicated.

For ANPPfoliage and ANPPwoody stem, I would recommend just analyzing log(ANPPfoliage/ANPPwoodystem) as the response variable, and dropping analyses of ANPPfoliage/NPP and ANPPstem/NPP. The latter two analyses mix allocation belowground vs. aboveground (treated in the analyses of ANPP and BNPP) with Analyzing ANPPfoliage/NPP mixes the belowground allocation with the allocation partitioning aboveground.

This would lead to a hierarchical partitioning: NPP vs Rauto as components of GPP, ANPP vs. BNPP as components of NPP, and ANPPstem vs. ANPPfoliage as components of ANPP (although of course that misses reproductive allocation…). And all of these could be analyzed with response variables as log-transformed ratios.

teixeirak commented 4 years ago

Registering @bpbond's comment here:

Re working with ratios in linear models, a couple of potentially useful resources: Lajeunesse, M. J.: Bias and correction for the log response ratio in ecological meta-analysis, , doi:10.1890/14-2402.1, 2015 https://hansjoerg.me/2019/05/10/regression-modeling-with-proportion-data-part-1/

beckybanbury commented 4 years ago

Having looked at this a bit more closely, I think that part of the issue may be with how data is entered in ForC. Several sites have entries for multiple NPP and ANPP values e.g. NPP_1 and NPP_2 or ANPP_1 and ANPP_2 for the same year and study, because of the way ForC uses specific definitions of components. Sites only have one BNPP_root value, because we don't have this same component method for BNPP_root. This means that for a given site, we might have one BNPP:NPP ratio, but four ANPP:NPP ratios. I don't think calculating a BNPP value based on each combination of ANPP or NPP would be a good idea here, but obviously its also a problem re pseudoreplication of ANPP:NPP ratios. One solution might be to only use one definition of NPP/ANPP etc in the calculations (though I think we did this initially and decided against it, see issue #52 ). Maybe a better approach would be to take an average ratio for each site?

teixeirak commented 4 years ago

Hmmm, that's tricky (and part of why this analysis isn't focal in the manuscript). We obviously need to choose just one type of ANPP/ NPP measurement per site. Two options: 1- select the more common variant and use only that one 2- always go with the more inclusive measurement I'll let you determine which works better.

hmullerlandau commented 4 years ago

If all the ANPP measurement methods are okay (are they?), then I would hate to see a site dropped just because it didn't have ANPP measured in the single preferred way. What about determining a ranking of ANPP methods, and then creating a variable named ANPP_best or something like that, which takes the value of the best ANPP method available for a given site?

teixeirak commented 4 years ago

They're defined based on what components they include. So, for example, one may include branch fall and the other not. I wouldn't say they're acceptable/unacceptable, just more or less inclusive/thorough. "Best" would be most thorough, but it may be biased in that it includes rarely measured components.

teixeirak commented 4 years ago

FYI, @hmullerlandau , we pooled them for the main analysis, after checking that it didn't introduce any bias.

beckybanbury commented 4 years ago

@teixeirak I have worked on this now, and calculated ANPP and BNPP as Helene suggested where possible (just refining graphs plotting log(ANPP/BNPP) against climate).

I just wanted to clarify about doing this for the other variables. Firstly, do you think it's okay to calculate R auto from GPP + all variables of NPP? n the ForC_variables file, GPP = R_auto + NPP_5, but we don't have enough measures of NPP_5 to make this viable, so if we want to do any analysis on NPP:R_auto ratios we'd have to include all NPP measures.

Finally, what do you think about calculating ANPP_1 from ANPP_woody_stem and ANPP_foliage? I assume this is not how those variables combine, because of differences between how ANPP_foliage and ANPP_litterfall are defined.

I think what I am aiming for with this analysis now is probably just to calculate log of three ratios (ANPP:BNPP, foliage:stem, and ideally NPP:R_auto, but if not I'll look at doing something with NPP:GPP) - this makes most sense to me with looking at shifts in allocation as Helene says.

teixeirak commented 4 years ago

I just wanted to clarify about doing this for the other variables. Firstly, do you think it's okay to calculate R auto from GPP + all variables of NPP? n the ForC_variables file, GPP = R_auto + NPP_5, but we don't have enough measures of NPP_5 to make this viable, so if we want to do any analysis on NPP:R_auto ratios we'd have to include all NPP measures.

Probably okay, although it assumes minimal decoupling of the two through storage of nonstructural carbohydrates. That said, I don't think it's necessary to get at this; I'd drop it.

teixeirak commented 4 years ago

Finally, what do you think about calculating ANPP_1 from ANPP_woody_stem and ANPP_foliage? I assume this is not how those variables combine, because of differences between how ANPP_foliage and ANPP_litterfall are defined.

Sure.

teixeirak commented 4 years ago

I think what I am aiming for with this analysis now is probably just to calculate log of three ratios (ANPP:BNPP, foliage:stem, and ideally NPP:R_auto, but if not I'll look at doing something with NPP:GPP) - this makes most sense to me with looking at shifts in allocation as Helene says.

Yes, I like that more concise list. Let's go with NPP:GPP.

beckybanbury commented 4 years ago

Do you mean to run just a linear model of NPP:GPP, or should we try a beta regression like Helene suggested?

FYI I have just saved figures of regressions using the log ratios here (nothing is significant).

teixeirak commented 4 years ago

Thanks! Your call on NPP:GPP.

beckybanbury commented 4 years ago

@teixeirak @hmullerlandau it seems very complicated to run beta regressions for mixed models, so I have decided against doing that, and have just stuck with doing the log of true ratios, as in the plot attached here. If you're unsure about NPP:R_auto, I could change this to log( NPP/(GPP - NPP)) (bearing in mind the current data I used includes GPP - NPP = R_auto) so that it is still a log of proportions, or we could just go back to doing a simple linear regression of NPP:GPP.

I haven't calculated ANPP stem or foliage from ANPP because I felt that was a bit ambiguous with the ForC definitions we're working with.

Do you feel that this addresses the problems sufficiently?

all

hmullerlandau commented 4 years ago

That sounds fine to me. Are the graphs going to be included in the paper? If so, it would be nice to have the y axis be log scale untransformed values, for easier interpretation.

teixeirak commented 4 years ago

Ditto.

teixeirak commented 4 years ago

I think this is fully resolved.

beckybanbury commented 4 years ago

all

This is the figure with untransformed values. I don't really like removing outliers, but what do you think about the ANPP:BNPP ratios? It's a pretty huge outlier - I've checked the literature and it was recorded correctly, but it does look very off (removing outliers has no effect on significance)

teixeirak commented 4 years ago

Yes, go ahead and remove.