Closed dougwyu closed 1 year ago
Thanks for your question Doug.
I am afraid I cannot manage to reproduce issue 1). For example,
library(gllvm)
data(spider)
mod<-gllvm(spider$abund,X=spider$x,formula=~herb.layer , lv.formula=~soil.dry+fallen.leaves+bare.sand+moss,family="poisson",num.lv.c=2)
coefplot(mod)
does not produce the behavior you describe. It produces one "coefplot" per predictor, as expected. Can you provide a reproducible example? Note that your example model above twice includes the lv.formula
argument, which would throw an error in R ("formal argument "lv.formula" matched by multiple actual arguments").
With respect to 2), I personally would be a bit careful with approaching model-selection on the basis of appearance/disappearance of significant effects, and what might/might not be "true drivers". Because all model terms are estimated simultaneously, a change in model structure can change parameter estimates and their standard errors in unpredictable ways, and predicting why/how they change is very difficult (especially in the presence of collinearity between predictors).
The difference between model 1) above and model 2) is m-2
parameters, with m
being the number of species, so that model 1) is quite a bit more flexible than model 2), and that flexibility is in the "X_1" term.
In an attempt to explain the above: if "X_1" is important for model fit, when it is omitted the model might attempt to compensate for that by modifying other terms. Here, that would have to be the predictor terms on the latent variable-scale. That does not necessarily mean that "X_1" or any of the other predictors are important/not, but it is "just" the model trying to find a set of parameter estimates that provide for an as decent fit as possible for the data (which, for the record, might still also be a terrible fit).
Selecting an appropriate model for inference by significance will likely lead to p-hacking, so I would advice finding a different method for determining the best model fit. Significance should not come into play until you have decided which model is "best" or "appropriate", which incorporates many things, including your expectations/feel for the system, and also e.g., checking residual diagnostics for assumption violations.
I am not sure if I would consider including predictors on the latent variable-scale because "they might not be the true drivers". After all, if they aren't the true drivers they should probably not be in the model, and otherwise the modeling procedure should reject/confirm if they are the effects that drive the process the data was generated from. This might require some deeper thought than I am capable of today, but this might be a case where you want to treat the latent variable predictor effects as random-effects with the randomB
argument.
thanks for that Bert.
just to confirm, i wrote the command above incorrectly, and i did indeed use formula = ~ X_1, lv.formula = X_2 +...
.
unfortunately, I cannot create a reprex using public data, and this dataset isn't mine to post.
however, here's an example of what it looks like: coefplot(fit, cex.ylab = 0.5, y.label = FALSE, mfrow = c(2, 3))
(I set mfrow = c(2,3) to display here)
the x-scales are different between the top and bottom rows.
there are 665 species in this dataset, and our species names are very long (with the identifying codes cut out of the coef plot), so i'm unable to really compare the plots in detail. your microbial dataset vignette also has lots of species and only shows one coef plot per predictor, so i was surprised to see this.
With regards to model selection, thanks very much for the advice. We fortunately have an a priori plan to test X_1,2,3,4
, so i put them all in the model, but i also wanted to see how robust the results were to changes, which is why i started experimenting with removing one of them. I also tried the randomB
argument, but gllvm
crashed, so i stopped that for now.
How odd! Both (gllvm crashing and double the number of plots) should not be happening, and neither I can reproduce. Which version of gllvm are you working with? I am happy to investigate this to clear things up.
1.4.1
I'd be happy to email you the dataset if you want. I just can't post it.
doug
Have a look at my development branch here: https://github.com/BertvanderVeen/gllvm/tree/gllvmUpdate. Please install it and try it out, it should solve your problems.
thanks very much! i have set the job running. will let you know!
Hi Bert, I installed the update and got this error after running an unconstrained ordination. (this had run without error (and gave expected results) on the main-branch version 1.4.1, maybe you have an evil parenthesis or comma somewhere.)
devtools::install_github("BertvanderVeen/gllvm@gllvmUpdate")
fitnull <- gllvm( y = y, family = binomial(), num.lv = 1, studyDesign = Site, row.eff = ~ (1 | Site), control.start = list(n.init = 10) )
Standard errors for parameters could not be calculated, due to singular fit. Warning message: In retape_adgrad() : internal error -3 in R_decompress1
This looks like a corrupt package installation. Did you forget to restart R or remove the package before the new installation by any chance?
ah. i thought i had, but apparently not. running now. thanks
it works! thank you.
is there an easy-ish way to extract the species coefficients from coefplot()
in table form?
Glad I could help! And thanks for reporting this.
You can extract the coefficients, and standard errors, like this:
mod$params$theta%*%t(mod$params$LvXcoef)
gllvm:::RRse(mod)
Thank you again. Results are indeed clearer with all predictors in num.lv.c, so i'm a fan of concurrent ordination!
I've been doing a little experimentation with concurrent ordination. I have, say, 4 predictors, with
X_1
being the one that i'm interested in scientifically, while the others are nuisance variables. An example would beX_1
being a management treatment, andX_2,3,4
being local environmental measurements like nearby landcover, and so on.mod1:
gllvm(y, X, num.lv.c = 2, formula = ~ X_1, lv.formula = ~ X_2 + X_3 + X_4, studyDesign = Site, row.eff = ~ (1 | Site))
mod2:
gllvm(y, X, num.lv.c = 2, lv.formula = ~ X_1 + X_2 + X_3+ X_4, studyDesign = Site, row.eff = ~ (1 | Site))
With mod1, i get coefficients for
X_1
, and i forceX_2,3,4
+ latent variables to model the residuals. With mod2, i forceX_1,2,3,4
+ latent variables to model the whole dataset. So i understand why i would put allX
inlv.formula
, but when i ran mod2, i got some slightly strange results:when i run
coefplot()
, i get two coef plots perX
(i.e. eight plots total forX_1,2,3,4
), and the two plots for each predictor are different. Is this expected? How do i read these?I experimented and ran a slightly simpler mod3 (because i thought
X_3
might be measured badly):fit <- gllvm(y, X, num.lv.c = 2, lv.formula = ~ X_1 + X_2 + X_4, studyDesign = Site, row.eff = ~ (1 | Site))
In this case,X_1,2,4
all became non-significant (viafit[["Coef.tableLV"]]
) whereas in mod2,X_1,2,3,4
had all been significant. I suspect there's a collinearity effect thatX_3
somehow prevented, but i don't have a good idea.Thus, my interpretation is that the "risk" of putting
X_1
informula
is that i am forcing the model to attribute significant effects toX_1
when the effects might be caused in part by other environmental covariates (measured or unmeasured). On the other hand, sometimes I have good reason to believe thatX_2,3,4
are not the true drivers, so i want to attribute significant effects toX_1
if any exist.Is this basically right? wrong? thanks in advance