Categorical enviornmental variables

LuziaThea commented 2 years ago

Hi @MaximilianPi and @florianhartig

Thank you so much for this amazing package!! It addresses such an important problem in the SDM field!

I have a basic question on how to work with categorical environmental variables when fitting sjSDM. I noticed that when I have categorical environmental variables, I get weights not for each environmental variable, but for each value of my environmental variable individually.

Here is a reproducible example to illustrate what I mean:

## simulate input data
com = simulate_SDM(env = 3L, species = 7L, sites = 100L)

## fit model with numerical environmental variables
model_num = sjSDM(Y = com$response, env = com$env_weights, iter = 50L) 
model_num[["names"]]
model_num[["weights"]][[1]] # weights for each environmental variable

## fit model with categorical environmental variables
env_categorial = data.frame(V1 = factor(sample(x = c("A", "B", "C"), size = 100, replace = T)),
                            V2 = factor(sample(x = c("Time1", "Time2", "Time3", "Time4"), size = 100, replace = T)),
                            V3 = factor(sample(x = c("ZZ", "XX"), size = 100, replace = T)))

model_cat = sjSDM(Y = com$response, env = env_categorial, iter = 50L) 
model_cat[["names"]]
model_cat[["weights"]][[1]] # weights for each VALUE of each environmental variable

How would I do this correctly to get weights for each environmental variable similar to numerical environmental variables when working with categorical variables? Also, how would you recommend to incorporate time series data in your model currently?

Thank you again so much for your efforts and best wishes, Luzia

MaximilianPi commented 2 years ago

Hi @LuziaThea,

Most models, and this also true for JSDMs or regression models, cannot handle categorical features (some of the tree based ML methods are an exception). The default to handle categorical variables is to create n-1 dummy variables (n= number of levels in the categorical feature) and then estimate an effect for each of the levels.

We can take a look at the dummy variables by running:

model.matrix(~., data = env_categorial)

  (Intercept) V1B V1C V2Time2 V2Time3 V2Time4 V3ZZ
1           1   0   0       0       0       1    0
2           1   0   0       0       0       1    1
3           1   1   0       0       1       0    0
4           1   0   0       0       1       0    1
5           1   0   1       0       0       0    0
6           1   0   0       0       0       0    1

We see that the dummy variables tell the model which level is present (1) or not (0) for each categorical feature. N-1 because the first levels of the categorical groups are used as reference (meaning that other dummy variables (of one feature) = 0 corresponds to the first level of the categorical variables). Also, the estimates for the categorical features are always refer to the reference level, so if you have one categorical variable with 3 levels and effect estimates of 10, 0.1, -0.2, the actual effects for the three levels are 10, 10+01, and 10-0.2. If you don’t want to compare your levels to a specific level, you can tell the model by changing the formula to:

model.matrix(~0+., data = env_categorial)
  V1A V1B V1C V2Time2 V2Time3 V2Time4 V3ZZ
1   1   0   0       0       0       1    0
2   1   0   0       0       0       1    1
3   0   1   0       0       1       0    0
4   1   0   0       0       1       0    1
5   0   0   1       0       0       0    0
6   1   0   0       0       0       0    1

(And we see that we have now n dummy variables for each categorical variable)

There are different ways to encode categorical features (the encodings are called contrats, see also https://cran.r-project.org/web/packages/faux/vignettes/contrasts.html) and 0/1 (named treatment) is the default encoding.

Now to get back to your question: You are doing everything right, there is just no easy way to treat categorical features like continuous features, but I assume that you want to get the average effect off the categorical variables, right? We can only do this post-hoc, i.e. we can estimate the mean effect of a categorical feature based on the level-effects and their standard errors from the SDJM models by using the metafor package. For example, for the V1 variable of the first species:


library(metafor)
model_cat = sjSDM(Y = com$response, env = linear(env_categorial, ~0+.), iter = 50L, se = TRUE) 
sum_output = summary(model_cat)
effs_V1 = sum_output$coefmat[1:3,1]
se_V1 = sum_output$coefmat[1:3,2]
mean_v1 = metafor::rma(effs_V1, sei = se_V1)
summary(mean_v1)

Random-Effects Model (k = 3; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
  0.0346   -0.0692    3.9308    1.3171   15.9308   

tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.1424)
tau (square root of estimated tau^2 value):      0
I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity:
Q(df = 2) = 0.1036, p-val = 0.9495

Model Results:

estimate      se     zval    pval    ci.lb   ci.ub    
 -0.0416  0.2164  -0.1924  0.8474  -0.4658  0.3825

LuziaThea commented 2 years ago

Hi @MaximilianPi

Thank you so so much for your fast and super detailed response! Thanks a lot that you took the time to do a complete statistic recap and explained how to do the post-hoc analysis with metafor, that was exactly was I was looking for!!

TheoreticalEcology / s-jSDM

Categorical enviornmental variables #103