Best practice for visualizing tidymodels last_fit() object

viv-analytics commented 1 year ago

Dear Authors,

I'm currently struggling with the shapviz explainer for tidymodels. I've checked the examples provided for the diamond package using only fit() for the training data.

What would be your recommended best practices for visualizing the following two tidymodels objects:

1.) Results from workflow_map() using cv_resamples and tuning_grid using an xgboost model on training data

2.) Results from last_fit() using the best model from 1.)

Thanks in advance for your suggestions.

mayer79 commented 1 year ago

Could you please add a small, working example for the pre-shapviz part?

viv-analytics commented 1 year ago

Setup Block

library(modeldata)
library(tidymodels)
library(tidyverse)
library(themis)

data("credit_data")

credit_data <- credit_data %>% drop_na()

Model Building Block


set.seed(1234)
split <- initial_split(credit_data, prop = 0.75,  strata = "Status")

train <- training(split)
test <- testing(split)

set.seed(1234)
cv_folds <- vfold_cv(data   = train, v = 5, strata = "Status")

classification_metrics <- metric_set(f_meas,mcc,roc_auc,pr_auc)

set.seed(1234)
ctrl_grid <- 
    control_grid(save_pred = TRUE, save_workflow = TRUE, allow_par = TRUE, 
                     verbose = TRUE, parallel_over = "everything")

basic_rec <- 
    recipe(Status ~ ., data = train) %>%
    step_impute_bag(Home, Marital, Job, Income, Assets, Debt) %>% 
    step_dummy(Home, Marital, Records, Job, one_hot = T) 

balanced_rec <- 
    basic_rec %>% 
    step_rose(Status, seed = 1234)

boost_tree_xgboost_spec <-
  boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), 
         min_n = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = tune()) %>%
  set_engine('xgboost', importance = TRUE) %>%
  set_mode('classification')

xgb_wflowset <- 
    workflow_set(preproc = list(basic = basic_rec, balanced = balanced_rec),
             models  = list(xgboost  = boost_tree_xgboost_spec),
                     cross   = F)

xgb_wflowmap <- 
    workflow_map(object = xgb_wflowset,
                      fn = "tune_grid",
                      resamples = cv_folds,
                      grid = 10, 
              metrics = classification_metrics, 
                      control = ctrl_grid,
                      seed = 1234)

Finalizing Block


workflow_results <- 
    xgb_wflowmap %>% 
    rank_results(rank_metric = "roc_auc") %>% 
    filter(.metric == "roc_auc") %>% 
    select(wflow_id, model, .config, metric = mean, rank) %>% 
    group_by(wflow_id) %>% 
    slice_min(rank, with_ties = FALSE) %>% 
    ungroup() %>% 
    arrange(rank)

workflow_results_id_best <- 
    workflow_results %>% 
    slice_min(rank, with_ties = FALSE) %>% 
    pull(wflow_id)

workflow_results_best <-
    xgb_wflowmap %>% 
    extract_workflow_set_result(id = workflow_results_id_best) %>% 
    select_best(metric = "roc_auc")

set.seed(1234)
workflow_results_best_fit <-
    xgb_wflowmap %>% 
    extract_workflow(workflow_results_id_best) %>% 
    finalize_workflow(workflow_results_best) %>% 
    last_fit(split = split, metrics = classification_metrics)

workflow_results_best_fit %>% collect_metrics()
workflow_results_best_fit %>% collect_predictions()

mayer79 commented 1 year ago

Thanks a lot for the example.

I changed some stuff (unrelated to the actual question) and made the full workflow a bit more compact.

Switched to consistent scoring function for probabilistic classification -> log loss
Removed oversampling step (useful only for deep learning stuff with small batch size; otherwise choosing a good scoring function does the trick)
Replaced OHE by integer encoding (could also try XGBoost's new categorical encodings)
Improved the tuning strategy. It does not make sense to select the number of boosting rounds (A) by grid search and early stopping. One of the two suffices.

To use shapviz for XGBoost model fitted via Tidymodels, I used the logic from my blog post here: https://lorentzen.ch/index.php/2023/01/27/shap-xgboost-tidymodels-love/

The same approach works for LightGBM models. For other models, the workflow is very easy via {kernelshap}.

library(modeldata)
library(tidymodels)
library(tidyverse)

data("credit_data")

set.seed(1234)
split <- initial_split(credit_data, prop = 0.90,  strata = "Status")

train <- training(split)
test <- testing(split)
cv_folds <- vfold_cv(data = train, v = 5, strata = "Status")

preprocessor <- recipe(Status ~ ., data = train) %>%
  step_integer(Home, Marital, Records, Job) 
prep(preprocessor, head(train))

specification <- boost_tree(
    mode = "classification",
    tree_depth = tune(), 
    trees = 10000, 
    learn_rate = tune(), 
    stop_iter = 20
  ) %>% 
  set_engine("xgboost", nthread = 4, validation = 0.2)

workflow_xgb <- workflow(preprocessor, spec = specification)

# ~1 minute
tuned <- tune_grid(
  workflow_xgb,
  resamples = cv_folds,
  grid = expand.grid(tree_depth = 2:4, learn_rate = c(0.05, 0.1)),
  metrics = metric_set(mn_log_loss, roc_auc)
)

# How to use the number of rounds found by early-stopping instead of 20% internal validation??
best_fit <- workflow_xgb %>%
  finalize_workflow(select_best(tuned, metric = "mn_log_loss")) %>%
  last_fit(split)

Now the SHAP part:

library(shapviz)

# Will explain THIS dataset later
set.seed(2)
small <- train[sample(nrow(train), 1000), ]
small_prep <- bake(
  prep(preprocessor), 
  has_role("predictor"),
  new_data = small, 
  composition = "matrix"
)
head(small_prep)

shap <- shapviz(extract_fit_engine(best_fit), X_pred = small_prep, X = small)

sv_importance(shap, show_numbers = TRUE)
sv_dependence(shap, v = c("Seniority", "Income", "Amount", "Home"))

viv-analytics commented 1 year ago

Thanks a lot @mayer79, I very much appreciate your improvements on this demo dataset as well as your explanations on the application of shapviz.

shapviz-step-1: small <- train[sample(nrow(train), 1000), ]
shapviz-step-2: small_prep <- bake(prep(basic_rec), has_role("predictor"), new_data = small, composition = "matrix")
shapviz-step-3: shap <- shapviz(extract_fit_engine(workflow_results_best_fit), X_pred = small_prep, X = small)

Q1; Why do you choose a reduced sample of the training data in shapviz-step-1 and as new_data in shapviz-step-2?

Q2: I wonder if there is another way to get the data for shapviz-step1 and -step2 directly from the last_fit() object or the workflow_map (xgb_wflowmap) object? Any ideas on this?

Q3: Furthermore, in case the target variable would not be binary but multiclass? Any adaptions needed especially for the shapviz-part in order to generate insights using sv_* functions?

mayer79 commented 1 year ago

Hmm.

Q1: Usually, the training data is too big to visualize in scatter plots, so it is normal to subsample 200-2000 rows. (And it can save some time).
Q2: I am pretty sure that this can be simplified, but not altogether. The current solution also requires that column transforms are 1:1. When you need to use OHE etc. due to any special reason, then you would need to map the SHAP values of the dummies back to the original column, via shapviz(, ..., collapse = list(...)). The core problem is that we need access to the original XGBoost model to be able to use the native TreeSHAP implementation. For any model except XGBoost or LightGBM, {kernelshap} does the trick without the long bake() logic.
Q3: This should work. You will get a figure per dimension, so the dependence plots would need to be called separately per feature. And you can extract each dimension via object[[1]] etc.

mayer79 commented 1 year ago

I will close the issue, but really want to make a blog post or vignette on this!

ModelOriented / shapviz

Best practice for visualizing tidymodels last_fit() object #97