JenniNiku / gllvm

Generalized Linear Latent Variable Models
https://jenniniku.github.io/gllvm/
48 stars 20 forks source link

Very small information criterion values (-3.503954e+195) #39

Closed KwnLi closed 2 years ago

KwnLi commented 3 years ago

Hello, I am finding an unexpected outcome when I fit my dataset to environmental variables without latent variables. All the information criteria (AIC, AICc, and BIC) are -3.503954e+195. When I add latent variables (in a process based on the vignette) I get more expected values (AICc values shown below):

             0              1              2              3              4 
-3.503954e+195   7.486189e+03   8.720227e+03   9.069381e+03   9.235085e+03 

The first value of the vector above (0) corresponds to the model without a latent variable.

Any idea what might be behind this issue? I am using a negative binomial family on a matrix with 23 species and 144 sites. I fit this to 3 environmental variables. I am also fitting an offset for the number of traps set in a site (n): offset = log(n) .

Thank you for your work on this package and for any ideas you might have!

BertvanderVeen commented 3 years ago

Thanks for your question! That is odd, can you have a look at the output of logLik(model)?

KwnLi commented 3 years ago

Thank you for looking at this! The output is: 'log Lik.' 1.751977e+195 (df=138)

BertvanderVeen commented 3 years ago

OK, so it seems that there is a convergence issue with your model. Does this specific model have (considerably) more parameters than the others?

Try trouble shooting that by, e.g., changing the starting values or seed (see ?gllvm), checking the gradient values, or by adding and remove covariates to see if any specific covariate is causing issues.

KwnLi commented 3 years ago

Thanks for the suggestion. The model that had the very low IC had 0 latent variables compared to the to the others which had 1-4 (number of latent variables is shown in the first row of the r output of my first post). There were 3 covariates. I know that two of the covariates have a higher correlation. Along the lines of your suggestion, I removed one of them and ran the same comparison, i.e. between models with 0-4 covariates. Here's the output:

        0         1         2         3         4
11615.926 18068.258  7383.033  8713.173  9082.542

So maybe it's related to the correlated covariates? Could it be an issue related to multicollinearity? In my original post, you'll note that adding at least one latent variable allows the model with three covariates to converge. Is this a valid approach, or might the non-convergence be evidence that the two correlated variables shouldn't be in the model together?

Thank you for your help!

BertvanderVeen commented 3 years ago

Yes, collinearity can result in convergence issues, which should also show in the gradient. If the two covariates are highly colinear, you could consider dropping one of the two (or fitting a model with either and seeing which model performs best, measured with information criteria).

Additionally, it is always good practice to have a look at the model convergence when the (continuous) covariates are mean centered and scaled to unit variance, as this generally improves convergence.

Good luck! Let me know if you have any other questions.

KwnLi commented 3 years ago

Thanks again for your help! Scaling and centering helped with convergence.

A follow up question I have is whether there is a way to assess the effect of multicollinearity in these models?

I had both of the (possibly) colinear variables included because a reviewer wanted to know the effect of one of them (altitude), though the other one (distance to forest) is my main focus. The effect of altitude is not of primary interest in the research question, but because of the topography of the field site, locations that are far from the forest only have higher altitude. Closer to the forest, there is a range of altitudes in the data.

Thanks!

BertvanderVeen commented 3 years ago

Glad to hear that it helped. In general, no, I don't know of a way to assess the effects of multicollinearity. In general, models with random-effects tend to be fussy about convergence, so it is good practice to double check everything is in order (using the suggestions I made above; compare model fits with scaled and unscaled variables, check the gradient, etc.).

Overall, if two environmental variables are collinear, that probably tells you something vital about your study system, and the best practice is to consider how important it is for both of them to be included in the model. One way or another, if the model convergences (which generally is the case if it is not overparameterized and the continuous covariates are scaled and centered), collinearity is not a (massive) issue. Just be careful when interpreting the parameter effects, as they are likely confounded (as in your described case).

In the case of altitude; there are no ecological effects on species distribution as a direct result of altitude. Altitude tends to represent some other ecological gradient(s), such as temperature, slope or ruggedness of the landscape, potential of snow patches or difference in soil conditions (e.g. moisture and/or nutrients). So, my advice, in the case of collinearity (for whatever variable), is to try and figure out the ecological drivers that your variables represent, and infer on the nature of their collinearity. Hopefully, this gives you sufficient support to exclude one or the other, or to include both.