TheoreticalEcology / s-jSDM

Scalable joint species distribution modeling
https://theoreticalecology.github.io/s-jSDM/
GNU General Public License v3.0
64 stars 14 forks source link

Time as "spatial" variable #108

Open Zurapiti opened 1 year ago

Zurapiti commented 1 year ago

Hello developers,

Thank you very much for your work. I am liking your package very much. I wonder if it would be safe/relevant/reasonable to introduce a "time" variable in the spatial terms, or it is not at all within the scope of the package.

I have a time series of species abundance data, which is spatially and temporally correlated. This data set comes from a monitoring program, thus we have several sites across a lot of time points.
I have run some tests with simulated data and has worked so far, but I am unsure if: 1) this is conceptually correct for this package, and 2) in case 1 is true, if my implementation would be correct.

In the attached file you have a minimal working example of what I am trying to achieve, but am just not sure it is a sensible approach.

33_pilot_sJSDM_spatiotemporalDNN.txt

Zurapiti commented 1 year ago

by the way, I am trying with my data and the model ran, but the ANOVA resulted in the following:

Analysis of Deviance Table

Terms added sequentially:

    Deviance Residual deviance R2 Nagelkerke R2 McFadden

Biotic NaN NaN NaN NaN Abiotic NaN NaN NaN NaN

thus, I thought maybe I am doing something wrong...

MaximilianPi commented 1 year ago

Hi @Zurapiti,

In general this is a valid approach, you correct for spatial/temporal trends + autocorrelation by modeling all your spatial and temporal variables with a DNN, which is perfectly fine and a good idea. Now for your problems:

  1. NAs in your simulation: It is caused by an underflow issue in the model, set dtype = "float64" in sjSDM(...) and the anova should work, here are the results of your simulation:
    
    > an_spatDNN
    Analysis of Deviance Table

Terms added sequentially:

    Deviance Residual deviance R2 Nagelkerke R2 McFadden

Biotic 1629.047 1176.614 1.000 0.3406 Abiotic 205.832 970.782 1.000 0.3836 Spatial 13.967 956.815 1.000 0.3866


2. Question: why is there no spatiotemporal effect in the model?
So there are two problems: first, your regularization is too strong, lambda = 0.4 will push all weights to zero. Normally (with enough observations) it is sufficient to set a small lambda (e.g. 0.001), especially if you use only small networks (which you do). The second problem is that you are only entering the X1:X2 interaction into the model, probably because you have simulated an interaction (CAR structure), but the problem is that you are adding the spatio-temporally structured residuals on the response level in your simulation after the inverse link function, but in the model X1:X2 go through the inverse link function (exp(X1*X2) != exp(X1)*exp(X2) ), which means that you have effectively simulated X1 and X2 additively on the linear scale.  
Long story short, if you change `DNN(XYcoords, ~X1:X2, lambda = 0.4)` to `DNN(XYcoords, ~X1 + X2, lambda = 0.001)`, the Anova results change to:

an_spatDNN Analysis of Deviance Table

Terms added sequentially:

    Deviance Residual deviance R2 Nagelkerke R2 McFadden

Biotic 1595.757 1209.595 1.000 0.3336 Abiotic 266.770 942.824 1.000 0.3894 Spatial 54.777 888.048 1.000 0.4009



And the model reports a spatio-temporal signal (which can be also seen in the interal plots). 

A few recommendation:
- DNN structure: set the intercept to 0 because the intercept is already modelled in the env object
- DNN structure: you don't need to use interactions there, just pass all your spatial/temporal predictors as additive effects to the model, the DNN is able to automatically learn interactions between the variables if needed.
Zurapiti commented 1 year ago

hey @MaximilianPi

thanks a lot for your reply. Your suggestions worked like a charm for up to ~25 species simulated with the simulate_SDM function. When I tried plugging in my data :( not quite the same result. The model runs, but the anova cannot be computed because the deviance of the biotic component becomes NaN. I don't think changing the dtype is possible anymore (is it?)

I simulated a small community that resembles much closer the type of data I have and turned out that this community has the same problem (NaN anovas; attached code in case you wanna give it a look).

So, have the question, how good can I use a model that cannot compute the anova you propose? Would it be at all important to be able to compute the anova to work with the models for say, predicting species abundances, or other questions beyond compartmentalizing the variance? Is there any other way to measure the contribution of different parts of the model?

thank you for your attention and very kind regards, Jazz

34_pilot_sJSDM_mimmicDataStructure.txt

MaximilianPi commented 1 year ago

Hi Jazz, Normally, the number of species does not have a large effect on numerical stability. However, it may happen that the distribution of species is so skewed (dispersion problems) that it cannot be modeled by a Poisson distribution (this is a common problem with Poisson distribution due to the assumption mean == variance). Species 1 is causing the error in your simulation (if you remove species 1, the Anova will not provide NAs). Let's take a step back and try to fit species 1 with a "simple" glm using glmmTMB and visualize the residuals:

library(glmmTMB)
library(DHARMa)
m = glmmTMB(Y~E1+E2+X1+X2+X1:X2, 
            data = data.frame(Y = Y[,1], 
                              E1 = X[,1], 
                              E2 = X[,2], 
                              X = scale(XYcoords)[,1], 
                              Y = scale(XYcoords)[,2]),
            family = poisson())
plot(simulateResiduals(m))
image

Okay, wow, the residuals look horrible, we have a overdispersion problem (testDispersion(m)). Let's switch to a negative binomial distribution:

m2 = glmmTMB(Y~E1+E2+X1+X2+X1:X2, 
            data = data.frame(Y = Y[,1], 
                              E1 = X[,1], 
                              E2 = X[,2], 
                              X = scale(XYcoords)[,1], 
                              Y = scale(XYcoords)[,2]),
            family = nbinom2())
plot(simulateResiduals(m2))
image

Residuals look good now!

So, in summary, it is difficult to model some of the species you simulated (species 1) with a Poisson distribution. For your empirical data, I recommend trying a Poisson distribution first, and if that doesn't work, you should transform your data to PA, since a binomial distribution has no dispersion problem.

I will try to implement a negative binomial distribution for sjSDM, but this may take a few weeks (I will also try to make the Anova more stable).

Zurapiti commented 1 year ago

Hey @MaximilianPi , thank you very much for your time and replies. After your last comment I double checked all the species and realized that most of them suffer of the same problem, so removing few species will not be a good solution :( I guess I am not the only person dealing with over dispersed data, so that negative binomial would be a really nice add up to your package. Kind regards, Jazz

metagenAu commented 1 year ago

Hi @MaximilianPi ,

A negative binomial option would be really awesome!

Chris

MaximilianPi commented 1 year ago

@Zurapiti @metagenAu new version with negative Binomial is on the way to CRAN

metagenAu commented 1 year ago

Awesome! I'm keen to test this out.

Chris

Zurapiti commented 1 year ago

woooo, thank you <3 Jazz

MaximilianPi commented 1 year ago

@metagenAu , @Zurapiti

Let me know what you think, it is still experimental and I am not sure how stable it is. If you run into NAs during training, restart with a lower learning rate and more iterations.

metagenAu commented 1 year ago

@MaximilianPi the functionality works well on the data I've tested. I will tune the model and then check the residuals to see the fit is appropriate etc. I'll run on a couple of datasets I have and let your know if there is any issues. I did have one question - in the GLLVM package they have an option for "row.eff" to account for the total number of counts in the community data - a sort of model offset. This can be useful when using sequencing data from NGS which gives uneven counts between samples that can confound results. Would it be best to model this in the spatial effects? Or the in environmental effects?

MaximilianPi commented 1 year ago

@metagenAu I would use it in the object where you set the intercept - so in the env object. Technically it doesn't matter, it will 'only' affect your anova (if you do one). In general, the idea for the separation of env and space is that we wanted to have the option to flexibly condition the model on space, e.g. by using a DNN for the spatial model, so I would only use spatial predictors in the spatial object and all other predictors (including an offset) should go into env.

metagenAu commented 1 year ago

OK great. It could be good functionality, down the track, to have a technical/random effects component. Good use cases for s-jsdm are NGS sequencing data - where sequencing depth, the sequencing run and the lab generating the sequencing can create error in the data.

MaximilianPi commented 1 year ago

Yes, I know, they could be estimated via MLE, but a reliable implementation (so that it works hand in hand with stochastic gradient descent) requires a lot of time and testing.