JenniNiku / gllvm

Generalized Linear Latent Variable Models
https://jenniniku.github.io/gllvm/
49 stars 20 forks source link

concurrent ordination: some predictors in full rank vs. all in latent variables #96

Closed dougwyu closed 1 year ago

dougwyu commented 1 year ago

I've been doing a little experimentation with concurrent ordination. I have, say, 4 predictors, with X_1 being the one that i'm interested in scientifically, while the others are nuisance variables. An example would be X_1 being a management treatment, and X_2,3,4 being local environmental measurements like nearby landcover, and so on.

mod1: gllvm(y, X, num.lv.c = 2, formula = ~ X_1, lv.formula = ~ X_2 + X_3 + X_4, studyDesign = Site, row.eff = ~ (1 | Site))

mod2: gllvm(y, X, num.lv.c = 2, lv.formula = ~ X_1 + X_2 + X_3+ X_4, studyDesign = Site, row.eff = ~ (1 | Site))

With mod1, i get coefficients for X_1, and i force X_2,3,4 + latent variables to model the residuals. With mod2, i force X_1,2,3,4 + latent variables to model the whole dataset. So i understand why i would put all X in lv.formula, but when i ran mod2, i got some slightly strange results:

  1. when i run coefplot(), i get two coef plots per X (i.e. eight plots total for X_1,2,3,4), and the two plots for each predictor are different. Is this expected? How do i read these?

  2. I experimented and ran a slightly simpler mod3 (because i thought X_3 might be measured badly): fit <- gllvm(y, X, num.lv.c = 2, lv.formula = ~ X_1 + X_2 + X_4, studyDesign = Site, row.eff = ~ (1 | Site)) In this case, X_1,2,4 all became non-significant (via fit[["Coef.tableLV"]]) whereas in mod2, X_1,2,3,4 had all been significant. I suspect there's a collinearity effect that X_3 somehow prevented, but i don't have a good idea.

Thus, my interpretation is that the "risk" of putting X_1 in formula is that i am forcing the model to attribute significant effects to X_1 when the effects might be caused in part by other environmental covariates (measured or unmeasured). On the other hand, sometimes I have good reason to believe that X_2,3,4 are not the true drivers, so i want to attribute significant effects to X_1 if any exist.

Is this basically right? wrong? thanks in advance

BertvanderVeen commented 1 year ago

Thanks for your question Doug.

I am afraid I cannot manage to reproduce issue 1). For example,

library(gllvm)
data(spider)
mod<-gllvm(spider$abund,X=spider$x,formula=~herb.layer , lv.formula=~soil.dry+fallen.leaves+bare.sand+moss,family="poisson",num.lv.c=2)
coefplot(mod)

does not produce the behavior you describe. It produces one "coefplot" per predictor, as expected. Can you provide a reproducible example? Note that your example model above twice includes the lv.formula argument, which would throw an error in R ("formal argument "lv.formula" matched by multiple actual arguments").

With respect to 2), I personally would be a bit careful with approaching model-selection on the basis of appearance/disappearance of significant effects, and what might/might not be "true drivers". Because all model terms are estimated simultaneously, a change in model structure can change parameter estimates and their standard errors in unpredictable ways, and predicting why/how they change is very difficult (especially in the presence of collinearity between predictors). The difference between model 1) above and model 2) is m-2 parameters, with m being the number of species, so that model 1) is quite a bit more flexible than model 2), and that flexibility is in the "X_1" term.

In an attempt to explain the above: if "X_1" is important for model fit, when it is omitted the model might attempt to compensate for that by modifying other terms. Here, that would have to be the predictor terms on the latent variable-scale. That does not necessarily mean that "X_1" or any of the other predictors are important/not, but it is "just" the model trying to find a set of parameter estimates that provide for an as decent fit as possible for the data (which, for the record, might still also be a terrible fit).

Selecting an appropriate model for inference by significance will likely lead to p-hacking, so I would advice finding a different method for determining the best model fit. Significance should not come into play until you have decided which model is "best" or "appropriate", which incorporates many things, including your expectations/feel for the system, and also e.g., checking residual diagnostics for assumption violations.

I am not sure if I would consider including predictors on the latent variable-scale because "they might not be the true drivers". After all, if they aren't the true drivers they should probably not be in the model, and otherwise the modeling procedure should reject/confirm if they are the effects that drive the process the data was generated from. This might require some deeper thought than I am capable of today, but this might be a case where you want to treat the latent variable predictor effects as random-effects with the randomB argument.

dougwyu commented 1 year ago

thanks for that Bert.

just to confirm, i wrote the command above incorrectly, and i did indeed use formula = ~ X_1, lv.formula = X_2 +....

unfortunately, I cannot create a reprex using public data, and this dataset isn't mine to post.

however, here's an example of what it looks like: coefplot(fit, cex.ylab = 0.5, y.label = FALSE, mfrow = c(2, 3)) (I set mfrow = c(2,3) to display here)

the x-scales are different between the top and bottom rows.

there are 665 species in this dataset, and our species names are very long (with the identifying codes cut out of the coef plot), so i'm unable to really compare the plots in detail. your microbial dataset vignette also has lots of species and only shows one coef plot per predictor, so i was surprised to see this.

Screenshot 2023-04-02 at 13 12 57

With regards to model selection, thanks very much for the advice. We fortunately have an a priori plan to test X_1,2,3,4, so i put them all in the model, but i also wanted to see how robust the results were to changes, which is why i started experimenting with removing one of them. I also tried the randomB argument, but gllvm crashed, so i stopped that for now.

BertvanderVeen commented 1 year ago

How odd! Both (gllvm crashing and double the number of plots) should not be happening, and neither I can reproduce. Which version of gllvm are you working with? I am happy to investigate this to clear things up.

dougwyu commented 1 year ago

1.4.1

I'd be happy to email you the dataset if you want. I just can't post it.

doug

BertvanderVeen commented 1 year ago

Have a look at my development branch here: https://github.com/BertvanderVeen/gllvm/tree/gllvmUpdate. Please install it and try it out, it should solve your problems.

dougwyu commented 1 year ago

thanks very much! i have set the job running. will let you know!

dougwyu commented 1 year ago

Hi Bert, I installed the update and got this error after running an unconstrained ordination. (this had run without error (and gave expected results) on the main-branch version 1.4.1, maybe you have an evil parenthesis or comma somewhere.)

devtools::install_github("BertvanderVeen/gllvm@gllvmUpdate")

fitnull <- gllvm( y = y, family = binomial(), num.lv = 1, studyDesign = Site, row.eff = ~ (1 | Site), control.start = list(n.init = 10) )

Standard errors for parameters could not be calculated, due to singular fit. Warning message: In retape_adgrad() : internal error -3 in R_decompress1

BertvanderVeen commented 1 year ago

This looks like a corrupt package installation. Did you forget to restart R or remove the package before the new installation by any chance?

dougwyu commented 1 year ago

ah. i thought i had, but apparently not. running now. thanks

dougwyu commented 1 year ago

it works! thank you.

is there an easy-ish way to extract the species coefficients from coefplot() in table form?

BertvanderVeen commented 1 year ago

Glad I could help! And thanks for reporting this.

You can extract the coefficients, and standard errors, like this:

mod$params$theta%*%t(mod$params$LvXcoef)
gllvm:::RRse(mod)
dougwyu commented 1 year ago

Thank you again. Results are indeed clearer with all predictors in num.lv.c, so i'm a fan of concurrent ordination!