BlasBenito / spatialRF

R package to fit spatial models with Random Forest
https://blasbenito.github.io/spatialRF/
109 stars 16 forks source link

use of `case.weights` versus `class.weights` in the case of a binary response? #12

Open mikoontz opened 2 years ago

mikoontz commented 2 years ago

tl;dr

I noticed some unexpected behavior of the permutation importance in the case of a binary response variable when using a regression approach for the random forest model. Variables that were highly important based on other "importance" metrics (e.g., mean minimum tree depth, observing large differences in predicted value across a gradient of that metric, number of times a root, the cross-validated importance value I get by using spatialRF::rf_importance()) were showing up as strongly negative in the standard $variable.importance score.

Some details

I built some {ranger} models directly to try to suss this out and think I've identified that this arises when treating a binary response as a regression problem.

My (naive) understanding is that the class.weights argument of ranger() is the best way to account for class imbalance given a binary (or other categorical) response. I believe that the {spatialRF} machinery (e.g., using spatialRF::case_weights()) passes that information along to case.weights instead of class.weights.

I am having a hard time understanding how case.weights and class.weights are being used in ranger() but the permutation importance when building a {ranger} model directly, having a binary response, and treating it as a classification problem (rather than regression) seems to track much better with the other measures of variable importance I listed above, which makes me suspect this is a fundamental issue that comes up when (inappropriately??) treating a binary response as a regression problem and using case.weights to try to account for class imbalance.

Anyway, I'm still trying to read more to better understand the implications for building the model but I thought I'd flag it for now!

[edit: I'm pasting in some of my investigation code in case that's useful...]

library(spatialRF)
library(ranger)

plant_richness_df$response_binomial <- ifelse(
  plant_richness_df$richness_species_vascular > 5000,
  1,
  0
)

case.wgts <- spatialRF::case_weights(data = plant_richness_df, 
                                    dependent.variable.name = "response_binomial")

predictor.variable.names <- colnames(plant_richness_df)[5:21]

# Regression problem with binary response and using case.weights
fm1 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = plant_richness_df[["response_binomial"]], 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = FALSE,
                      case.weights = case.wgts,
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm1$variable.importance))

# Classification problem with a factor as response variable, and using case.weights
fm2 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = TRUE,
                      probability = FALSE,
                      case.weights = case.wgts,
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm2$variable.importance))

# Probability estimation problem with a factor as response variable, and using case.weights
fm3 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = TRUE,
                      case.weights = case.wgts,
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm3$variable.importance))

# Probability estimation with a factor as response variable, and using class.weights
fm4 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = TRUE,
                      class.weights = unique(case.wgts),
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm4$variable.importance))

# Probability estimation with a factor as response variable, and using both class.weights and case.weights
fm5 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
                      y = as.factor(plant_richness_df[["response_binomial"]]), 
                      data = plant_richness_df,
                      classification = FALSE,
                      probability = TRUE,
                      case.weights = case.wgts,
                      class.weights = unique(case.wgts),
                      importance = "permutation",
                      seed = 1)

as.data.frame(sort(fm5$variable.importance))

# spatialRF
fm6 <- spatialRF::rf(data = plant_richness_df, 
                     dependent.variable.name = "response_binomial", 
                     predictor.variable.names = predictor.variable.names, 
                     seed = 1)

as.data.frame(sort(fm6$variable.importance))
as.data.frame(sort(fm1$variable.importance))

# spatialRF
fm7 <- spatialRF::rf(data = plant_richness_df, 
                     dependent.variable.name = "response_binomial", 
                     predictor.variable.names = predictor.variable.names, 
                     seed = 1)

as.data.frame(sort(fm7$variable.importance)) # the {spatialRF} version creates the same model as fm1
as.data.frame(sort(fm1$variable.importance))