I noticed some unexpected behavior of the permutation importance in the case of a binary response variable when using a regression approach for the random forest model. Variables that were highly important based on other "importance" metrics (e.g., mean minimum tree depth, observing large differences in predicted value across a gradient of that metric, number of times a root, the cross-validated importance value I get by using spatialRF::rf_importance()) were showing up as strongly negative in the standard $variable.importance score.
Some details
I built some {ranger} models directly to try to suss this out and think I've identified that this arises when treating a binary response as a regression problem.
My (naive) understanding is that the class.weights argument of ranger() is the best way to account for class imbalance given a binary (or other categorical) response. I believe that the {spatialRF} machinery (e.g., using spatialRF::case_weights()) passes that information along to case.weights instead of class.weights.
I am having a hard time understanding how case.weights and class.weights are being used in ranger() but the permutation importance when building a {ranger} model directly, having a binary response, and treating it as a classification problem (rather than regression) seems to track much better with the other measures of variable importance I listed above, which makes me suspect this is a fundamental issue that comes up when (inappropriately??) treating a binary response as a regression problem and using case.weights to try to account for class imbalance.
Anyway, I'm still trying to read more to better understand the implications for building the model but I thought I'd flag it for now!
[edit: I'm pasting in some of my investigation code in case that's useful...]
library(spatialRF)
library(ranger)
plant_richness_df$response_binomial <- ifelse(
plant_richness_df$richness_species_vascular > 5000,
1,
0
)
case.wgts <- spatialRF::case_weights(data = plant_richness_df,
dependent.variable.name = "response_binomial")
predictor.variable.names <- colnames(plant_richness_df)[5:21]
# Regression problem with binary response and using case.weights
fm1 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = plant_richness_df[["response_binomial"]],
data = plant_richness_df,
classification = FALSE,
probability = FALSE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)
as.data.frame(sort(fm1$variable.importance))
# Classification problem with a factor as response variable, and using case.weights
fm2 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = TRUE,
probability = FALSE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)
as.data.frame(sort(fm2$variable.importance))
# Probability estimation problem with a factor as response variable, and using case.weights
fm3 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)
as.data.frame(sort(fm3$variable.importance))
# Probability estimation with a factor as response variable, and using class.weights
fm4 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
class.weights = unique(case.wgts),
importance = "permutation",
seed = 1)
as.data.frame(sort(fm4$variable.importance))
# Probability estimation with a factor as response variable, and using both class.weights and case.weights
fm5 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
case.weights = case.wgts,
class.weights = unique(case.wgts),
importance = "permutation",
seed = 1)
as.data.frame(sort(fm5$variable.importance))
# spatialRF
fm6 <- spatialRF::rf(data = plant_richness_df,
dependent.variable.name = "response_binomial",
predictor.variable.names = predictor.variable.names,
seed = 1)
as.data.frame(sort(fm6$variable.importance))
as.data.frame(sort(fm1$variable.importance))
# spatialRF
fm7 <- spatialRF::rf(data = plant_richness_df,
dependent.variable.name = "response_binomial",
predictor.variable.names = predictor.variable.names,
seed = 1)
as.data.frame(sort(fm7$variable.importance)) # the {spatialRF} version creates the same model as fm1
as.data.frame(sort(fm1$variable.importance))
tl;dr
I noticed some unexpected behavior of the permutation importance in the case of a binary response variable when using a regression approach for the random forest model. Variables that were highly important based on other "importance" metrics (e.g., mean minimum tree depth, observing large differences in predicted value across a gradient of that metric, number of times a root, the cross-validated importance value I get by using
spatialRF::rf_importance()
) were showing up as strongly negative in the standard$variable.importance
score.Some details
I built some {ranger} models directly to try to suss this out and think I've identified that this arises when treating a binary response as a regression problem.
My (naive) understanding is that the
class.weights
argument ofranger()
is the best way to account for class imbalance given a binary (or other categorical) response. I believe that the {spatialRF} machinery (e.g., usingspatialRF::case_weights()
) passes that information along tocase.weights
instead ofclass.weights
.I am having a hard time understanding how
case.weights
andclass.weights
are being used inranger()
but the permutation importance when building a {ranger} model directly, having a binary response, and treating it as a classification problem (rather than regression) seems to track much better with the other measures of variable importance I listed above, which makes me suspect this is a fundamental issue that comes up when (inappropriately??) treating a binary response as a regression problem and usingcase.weights
to try to account for class imbalance.Anyway, I'm still trying to read more to better understand the implications for building the model but I thought I'd flag it for now!
[edit: I'm pasting in some of my investigation code in case that's useful...]