Closed LuziaThea closed 2 years ago
Hi @LuziaThea,
Most models, and this also true for JSDMs or regression models, cannot handle categorical features (some of the tree based ML methods are an exception). The default to handle categorical variables is to create n-1 dummy variables (n= number of levels in the categorical feature) and then estimate an effect for each of the levels.
We can take a look at the dummy variables by running:
model.matrix(~., data = env_categorial)
(Intercept) V1B V1C V2Time2 V2Time3 V2Time4 V3ZZ
1 1 0 0 0 0 1 0
2 1 0 0 0 0 1 1
3 1 1 0 0 1 0 0
4 1 0 0 0 1 0 1
5 1 0 1 0 0 0 0
6 1 0 0 0 0 0 1
We see that the dummy variables tell the model which level is present (1) or not (0) for each categorical feature. N-1 because the first levels of the categorical groups are used as reference (meaning that other dummy variables (of one feature) = 0 corresponds to the first level of the categorical variables). Also, the estimates for the categorical features are always refer to the reference level, so if you have one categorical variable with 3 levels and effect estimates of 10, 0.1, -0.2, the actual effects for the three levels are 10, 10+01, and 10-0.2. If you don’t want to compare your levels to a specific level, you can tell the model by changing the formula to:
model.matrix(~0+., data = env_categorial)
V1A V1B V1C V2Time2 V2Time3 V2Time4 V3ZZ
1 1 0 0 0 0 1 0
2 1 0 0 0 0 1 1
3 0 1 0 0 1 0 0
4 1 0 0 0 1 0 1
5 0 0 1 0 0 0 0
6 1 0 0 0 0 0 1
(And we see that we have now n dummy variables for each categorical variable)
There are different ways to encode categorical features (the encodings are called contrats, see also https://cran.r-project.org/web/packages/faux/vignettes/contrasts.html) and 0/1 (named treatment) is the default encoding.
Now to get back to your question: You are doing everything right, there is just no easy way to treat categorical features like continuous features, but I assume that you want to get the average effect off the categorical variables, right? We can only do this post-hoc, i.e. we can estimate the mean effect of a categorical feature based on the level-effects and their standard errors from the SDJM models by using the metafor package. For example, for the V1 variable of the first species:
library(metafor)
model_cat = sjSDM(Y = com$response, env = linear(env_categorial, ~0+.), iter = 50L, se = TRUE)
sum_output = summary(model_cat)
effs_V1 = sum_output$coefmat[1:3,1]
se_V1 = sum_output$coefmat[1:3,2]
mean_v1 = metafor::rma(effs_V1, sei = se_V1)
summary(mean_v1)
Random-Effects Model (k = 3; tau^2 estimator: REML)
logLik deviance AIC BIC AICc
0.0346 -0.0692 3.9308 1.3171 15.9308
tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.1424)
tau (square root of estimated tau^2 value): 0
I^2 (total heterogeneity / total variability): 0.00%
H^2 (total variability / sampling variability): 1.00
Test for Heterogeneity:
Q(df = 2) = 0.1036, p-val = 0.9495
Model Results:
estimate se zval pval ci.lb ci.ub
-0.0416 0.2164 -0.1924 0.8474 -0.4658 0.3825
Hi @MaximilianPi
Thank you so so much for your fast and super detailed response! Thanks a lot that you took the time to do a complete statistic recap and explained how to do the post-hoc analysis with metafor, that was exactly was I was looking for!!
Hi @MaximilianPi and @florianhartig
Thank you so much for this amazing package!! It addresses such an important problem in the SDM field!
I have a basic question on how to work with categorical environmental variables when fitting sjSDM. I noticed that when I have categorical environmental variables, I get weights not for each environmental variable, but for each value of my environmental variable individually.
Here is a reproducible example to illustrate what I mean:
How would I do this correctly to get weights for each environmental variable similar to numerical environmental variables when working with categorical variables? Also, how would you recommend to incorporate time series data in your model currently?
Thank you again so much for your efforts and best wishes, Luzia