Machine Learning Model for Mixed Data using VAEAC Approach

AbdollahiAz commented 8 months ago

Dear shapr,

Inspired from issue #385 , I used the airquality dataset to implement vaeac approach where the feature “month” is considered a categorical feature using as.factor(Month) (Is it correct?). I fitted 3 machine learning (ML) models including xgboost, lm and ranger. For example, for xgboost, I faced the following error:

Error in xgboost(data = as.matrix(x_train), label = y_train, nround = 100,  : 
  could not find function "xgboost"

I calculate the R2 error metric as follows: Lm: R2=0.6481237 Ranger: R2=0.1001726 As per attachment, you can see the calculated shap values are completely different values according to beeswarm plots.

My questions for mixed dataset are as follows: 1) Which machine learning algorithm should be used? 2) Which configuration should be applied? (Case1: ranger + vaeac + categorical; Case 2: lm+ vaeac +categorical; Case 3: xgboost+ vaeac +categorical)

Based on the suggestion of Using Shapley Values and Variational Autoencoders to Explain Predictive Models with Dependent Mixed Features paper, I expect to ranger work well. However, I am a little bit confused how I can trust the SHAP values when the fitted model is biased. Please help @LHBO. Please find the code in the following:

library(ranger)
library(shapr)
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]

# convert the month variable to a factor
data[, Month_factor := as.factor(Month)]

x_var_cat <- c("Solar.R", "Wind", "Temp", "Month_factor")
y_var <- "Ozone"

ind_x_explain <- 1:20
data_train_cat <- data[-ind_x_explain, ]
x_train_cat <- data_train_cat[, ..x_var_cat]
x_explain_cat <- data[ind_x_explain, ][, ..x_var_cat]

# Fit a random forest model to the training data
##Case 1
#model <- ranger(as.formula(paste0(y_var, " ~ ", paste0(x_var_cat, collapse = " + "))),
#                data = data_train_cat
#)
# predictions <- predict(model, data = x_explain_cat)$predictions
# actual <- data[ind_x_explain, get(y_var)]
# lm_model <- lm(predictions ~ actual)
# r2 <- summary(lm_model)$r.squared

##Case 2
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var_cat, collapse = " + ")))
model <- lm(lm_formula, data = data_train_cat)

predictions <- predict(model, newdata = x_explain_cat)
r2 <- summary(model)$r.squared

##Case 3
# model <- xgboost(
#   data = as.matrix(x_train),
#   label = y_train,
#   nround = 100,
#   verbose = FALSE
# )

# Specifying the phi_0, i.e. the expected prediction without any features
prediction_zero <- mean(data_train_cat[, get(y_var)])

# Then we use the vaeac approach
expl_vaeac_with <- explain(
  model = model,
  x_explain = x_explain_cat,
  x_train = x_train_cat,
  approach = "vaeac",
  prediction_zero = prediction_zero,
  n_batches = 1,
  n_samples = 250,
  vaeac.epochs = 50,
  vaeac.n_vaeacs_initialize = 4
)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  plot(expl_vaeac_with, plot_type = "scatter")
  plot(expl_vaeac_with, plot_type = "beeswarm")
}

Sincerely, Azam

LHBO commented 7 months ago

Hi.

could not find function "xgboost" means that you most likely have not installed xgboost. Have you installed it? If not, then do so.

I am surprised by the low R2 value for ranger compared to lm, but what is your justification for using R2 compared to, e.g., MSE? Have you made a mistake in Case 1? You call your model model, but when you compute the R2, you use model_lm. Is that intentional?

I would argue that the overall trends in your figures resemble each other. However, you have to remember that you are explaining two different models, so obtaining identical explanations would be strange.

Lars

AbdollahiAz commented 7 months ago

Dear @LHBO,

Thanks for your explanation. I address the typo and installed xgboost. It worked.

Sincerely, Az

NorskRegnesentral / shapr

Machine Learning Model for Mixed Data using VAEAC Approach #386