TheoreticalEcology / s-jSDM

Scalable joint species distribution modeling
https://cran.r-project.org/web/packages/sjSDM/index.html
GNU General Public License v3.0
68 stars 14 forks source link

Scale of predicted results and questions about tuning #117

Open YJ781 opened 1 year ago

YJ781 commented 1 year ago

Hi developers,

Thank you so much for your work. I am new to JSDM and currently learning your package to analyze a microbial dataset containing count data of ASV. I found that the prediction results are not on a similar scale as the training data. The summaries of training and predicted ASV data are as shown below. Therefore, I wonder if the output predictions are already scaled and if I need to scale the training data before modeling. Can you help me check it?

summary of training data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 292.5 505.5 871.0 1140.5 8401.0

summary of predicted data: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000159 0.005185 0.042844 0.329975 0.187011 7.820024

Here's the code I used to build model and make simulations. (train_X was scaled beforehand)

tune = sjSDM_cv(env = train_X, Y = train_Y, learning_rate = 0.01, iter = 100L, CV = 10, tune_steps = 40, lambda_cov = seq(0, 0.1, 0.001), lambda_coef = seq(0, 0.1, 0.001), alpha_cov = seq(0, 1, 0.05), alpha_coef = seq(0, 1, 0.05), sampling = 100L, biotic = bioticStruct(df=dim(train_Y)[2]), blocks = 6L, step_size = 4L, family=poisson("log"), n_cores = 6L)

best1 = tune[order(tune$logLik),][1,]

model1 = sjSDM(Y = train_Y, env = linear(train_X, lambda = best1[["lambda_coef"]], alpha = best1[["alpha_coef"]]), biotic = bioticStruct(lambda = best1[["lambda_cov"]], alpha = best1[["alpha_cov"]], df = dim(train_Y)[2]), learning_rate = 0.01, iter = 150L, family=poisson("log"), sampling = 1000L)

Additionally, I have two more questions related to above code.

  1. I found the actual number of CPU used for tuning is equal to the blocks number, not n_cores. Is this correct? I don't quite understand what the blocks mean.
  2. I received a warning after sjSDM_cv: Warning message: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 3) what does this warning mean? and Do I need to adjust something?
MaximilianPi commented 1 year ago

Hi @YJ781,

no, the predictions are not scaled, by default they are on the scale of your response (so the predictions are fine). I think the problem is that you have a underdispersion in your data (based on your summary of responses from your training data, your min/1st quartile doesn't fit your other quartiles), which makes it very hard to fit a poisson glm/jsdm properly to your data.

Over/underdispersion is often a challenge with count data, so I suggest that you either transform your abundance data into presence-absence data (e.g., by specifying that if the count_i > mean(counts) it is a 1 and 0 otherwise (for each species)) or try the new negative binomial distribution (family = "nbinom") from the latest CRAN version (but nbinom can be difficult to fit and you may need to try different learning rates).

About tuning:

  1. The number of blocks is the number of tuning steps that should be run before the cluster is restarted (in the past I had memory leakage problems when running everything on one cluster).
  2. I have never seen this warning, can you provide a reproducible example?

Best, Max

YJ781 commented 1 year ago

Hi Max,

Thank you so much for your suggestions. I realized that I didn't use all ASVs, but only 10 or 100 ASVs, to build the model, as the response matrix would become too large. This might lead to underdispersion. I'll try implementing your suggestions and also fitting the whole community to a Poisson distribution.

Regarding the warning, I've attached the response matrix and the scaled env matrix where I encountered the warning message. Thank you!

Best, Yj warning_response_training.csv warning_env_training.csv

MaximilianPi commented 1 year ago

Hi @YJ781,

Over/underdispersion

ASVs == Species, right? Then the under/overdispersion isn't connected to the number of ASVs, because it is ASV/species specific. Let's take a look at the first species with a traditional glm (sjSDM basically fits for each species a glm which are then connected by species-species associations):

  1. Poisson model:

    m = glmmTMB(Y[,1]~E, family = poisson())
    plot(simulateResiduals(m)) 
    image
  2. nbinom model:

    m_nb = glmmTMB(Y[,1]~E, family = nbinom1())
    plot(simulateResiduals(m_nb))
    image

Predictions

Even with the negative binomial distributions, the models (sjSDM and invidivudal glms) are unable to make predictions on the range of the response (e.g. first species up to 8000 counts, models predict a maximum of 9/10), which is probably explained by regression dilution (there is an additional error on X, but our models only account for errors on Y). I think the severity of it depends on your objective, if you want to use the model for predictions, it is a problem (and you could probably rescale your response), for inference, it is not a big problem if you are only interested in comparing the effects to each other (absolutely they are biased to zero)

Warning/Tuning

I agree that we need regularization because you have 27 predictors for 44 observations plus the association matrix. But with these small datasets, more splits in the CV is better because if you do for example only a 5-CV, you end up with small trainings data folds, so 20-CV or even a LOOCV would be the best.

I couldn't reproduce your warnings. But I slightly adjusted your code. The following code runs successfully my machine:

tune = sjSDM::sjSDM_cv(as.matrix(Y), E, biotic = bioticStruct(df = ncol(Y), reg_on_Cov = FALSE),
                       CV = 40L, tune_steps = 120L,n_gpu = 4L, 
                       iter = 150L,
                       n_cores = 12L, blocks = 24L, sampling = 1000L,
                       family = "nbinom",
                       ) 
tune$summary
best = tune$summary[order(tune$summary$ll_test),][1,]
m = sjSDM(as.matrix(Y)[-i,], 
          env = linear(E[-i,], lambda = best$lambda_coef, alpha = best$alpha_coef), 
          biotic = bioticStruct(df = ncol(Y), reg_on_Cov = FALSE, lambda = best$lambda_cov, alpha = best$alpha_cov),
          family = "nbinom", iter = 150L, device = sample.int(4, 1)-1L,
          sampling = 1000L)
summary(m)