statistical analyses - Githubissues

teixeirak commented 4 years ago

R1 suggests alternate analysis approach:

"Statistical analysis: the analysis would be more precise if the authors used Generalized linear mixed models to avoid data transformation and take into account (with random effects, as they do for the linear mixed model, LMM, presented now) the influence of species in the analysis. Then I missed a further discussion on the role of individual species (e.g. see my comment below on Rt>1)."

"Only assessing models with AIC is not enough to prove sound relationships in multidimensional and complex data-models. LL ratio tests or deviance-anova tests can be used in the case of nested models with LMM or GLMM respectively. "

@mcgregorian1 , @ValentineHerr , let's discuss this.

mcgregorian1 commented 4 years ago

Quick thing here is I remember when I was first reading up on AIC and BIC, that because AIC has been used so much, there are many scientists who are saying (as R1 does) that it's just not enough anymore. I've finished a class in Bayesian methods, and there we usually report two (DAIC plus a separate statistic) to be safe. My takeaway from this is that any future study I do must have something more than AIC.

On Fri, May 1, 2020 at 8:41 AM Kristina Anderson-Teixeira < notifications@github.com> wrote:

R1 suggests alternate analysis approach:

"Statistical analysis: the analysis would be more precise if the authors used Generalized linear mixed models to avoid data transformation and take into account (with random effects, as they do for the linear mixed model, LMM, presented now) the influence of species in the analysis. Then I missed a further discussion on the role of individual species (e.g. see my comment below on Rt>1)."

"Only assessing models with AIC is not enough to prove sound relationships in multidimensional and complex data-models. LL ratio tests or deviance-anova tests can be used in the case of nested models with LMM or GLMM respectively. "

@mcgregorian1 https://github.com/mcgregorian1 , @ValentineHerr https://github.com/ValentineHerr , let's discuss this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SCBI-ForestGEO/McGregor_climate-sensitivity-variation/issues/94, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJNRBEKWAHAEL2QZX7YVDF3RPK7OHANCNFSM4MXCJIBA .

--

Ian McGregor

Ph.D. Student | Center for Geospatial Analytics

He/Him/His

College of Natural Resources

Jordan Hall 4120 | Campus Box 7106

North Carolina State University

2800 Faucette Dr.

Raleigh, NC 27695 USA imcgreg@ncsu.edu | 714-864-1005 | geospatial.ncsu.edu

mcgregorian1 commented 4 years ago

Coming back to this again, I agree that maybe using GLMMs would be good, but this will involve some code rearranging. I think ultimately it makes more sense to adapt this first to GLMMs, and then do #93 and #92 since those results of course feed into the statistical method

teixeirak commented 4 years ago

I agree, noting that only #92 requires attention.

mcgregorian1 commented 4 years ago

I am trying GLMM

Multicollinearity

I realized (via Monika help) that I had never done a correlation plot of the fixed effects. After converting ring porosity and crown position to numeric, I ran the code and got the following result:

You'll notice dbh~height has a value of 1. This is accounted for when we determine the best models, since I specifically say to disregard any top model that has dbh in it (we had decided this from a previous discussion).

However, notice that dbh and height ~ position have a correlation coefficient of 0.73, which is close to the 0.8 cutoff. Keep in mind for below

Running the GLMMs

The candidate variables (tableS3_candidate_traits) did not change, only the distribution of them.
The overall best models for each scenario (all drought years plus each individual) did not change. However, two things:
- for tableS4_top_models_dAIC.csv, I remember a couple of the reviewers were asking why we made the threshold for AIC be 2, when in the original loop we had 1. I will change that.
- Now, though, this is my question (and probably a question for @ValentineHerr). Technically, there's literature to support that anything within 1 dAIC is essentially "equivalent" as a model. So for example, see table below. We automatically discard the last 2, but then we have 4 models that are the "same", which means the full model (row name of 60) can be considered the "best" even though it has a dAIC of 0.73. Thoughts? I realized when I was talking to Monika about her red panda research (she's also using GLMMs).
- also notice that for x1999, the dAIC for the second best model is 0.02.

Collinearity

From the correlation plot above, in this case, if we remove position (for example) due to the high correlation with height (0.73), then technically the top model is already the best model.

I also ran the variance inflation factor on just the top models as they currently are in tableS4_top_models_dAIC.csv (assuming only counting dAIC==0). The results are in tables_figures/top_models_dAIC_VIF.csv, but suffice to say that the highest value we have is 1.3, so that means everything is good (nothing highly correlated).

teixeirak commented 4 years ago

@mcgregorian, thanks for looking into this.

Regarding canopy position, the collinearity with tree size is definitely an issue-- not just statistical but also biological. Given that canopy exposure is one of the major hypotheses of interest, I wouldn't want to drop it completely from the analysis. Rather, its something that we need to--and do--address in the discussion (here). I'd keep it as is.

teixeirak commented 4 years ago

It looks like results are similar with GLMMs, right? Any changes to the outcome?

mcgregorian1 commented 4 years ago

No, the overall outcome (variables to be included in top models) remains the same. What can change is our approach to choosing the "best" model from those top variables (what I was talking about for the <1 dAIC).

On Tue, May 19, 2020 at 8:00 AM Kristina Anderson-Teixeira < notifications@github.com> wrote:

It looks like results are similar with GLMMs, right? Any changes to the outcome?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SCBI-ForestGEO/McGregor_climate-sensitivity-variation/issues/94#issuecomment-630772295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJNRBEMAXQHOM42LYGYE3RTRSJYEBANCNFSM4MXCJIBA .

--

Ian McGregor

Ph.D. Student | Center for Geospatial Analytics

He/Him/His

College of Natural Resources

Jordan Hall 4120 | Campus Box 7106

North Carolina State University

2800 Faucette Dr.

Raleigh, NC 27695 USA imcgreg@ncsu.edu | 714-864-1005 | geospatial.ncsu.edu

mcgregorian1 commented 4 years ago

@teixeirak what are your thoughts on my question deciding what the "best" model is? This is the updated table with everything <1 dAIC (to be consistent with doing this earlier).

In addition, I did the least likelihood ratio test (which is suggested online as doing anova over the models together).

I think we have a good thing with this.

We did AICc instead of normal AIC (see here and other sites.
Doing the anova for each of the models per their group (by scenario), we see a different ordering than what AICc gives us.

Neither of these fixes the issue of how do we compare models that are almost equivalent, but we could report both of these orderings and say based on this, we chose the top 4 models (one for each scenario) to be _____ . What do you think?

teixeirak commented 4 years ago

I think this is all taken care of... closing.

teixeirak commented 4 years ago

Reopening this because we still need to deal with significance tests for individual variables, issue #99.

SCBI-ForestGEO / McGregor_climate-sensitivity-variation

statistical analyses #94

Multicollinearity

Running the GLMMs

Collinearity