Closed viv-analytics closed 1 year ago
Could you please add a small, working example for the pre-shapviz part?
Setup Block
library(modeldata)
library(tidymodels)
library(tidyverse)
library(themis)
data("credit_data")
credit_data <- credit_data %>% drop_na()
Model Building Block
set.seed(1234)
split <- initial_split(credit_data, prop = 0.75, strata = "Status")
train <- training(split)
test <- testing(split)
set.seed(1234)
cv_folds <- vfold_cv(data = train, v = 5, strata = "Status")
classification_metrics <- metric_set(f_meas,mcc,roc_auc,pr_auc)
set.seed(1234)
ctrl_grid <-
control_grid(save_pred = TRUE, save_workflow = TRUE, allow_par = TRUE,
verbose = TRUE, parallel_over = "everything")
basic_rec <-
recipe(Status ~ ., data = train) %>%
step_impute_bag(Home, Marital, Job, Income, Assets, Debt) %>%
step_dummy(Home, Marital, Records, Job, one_hot = T)
balanced_rec <-
basic_rec %>%
step_rose(Status, seed = 1234)
boost_tree_xgboost_spec <-
boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(),
min_n = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = tune()) %>%
set_engine('xgboost', importance = TRUE) %>%
set_mode('classification')
xgb_wflowset <-
workflow_set(preproc = list(basic = basic_rec, balanced = balanced_rec),
models = list(xgboost = boost_tree_xgboost_spec),
cross = F)
xgb_wflowmap <-
workflow_map(object = xgb_wflowset,
fn = "tune_grid",
resamples = cv_folds,
grid = 10,
metrics = classification_metrics,
control = ctrl_grid,
seed = 1234)
Finalizing Block
workflow_results <-
xgb_wflowmap %>%
rank_results(rank_metric = "roc_auc") %>%
filter(.metric == "roc_auc") %>%
select(wflow_id, model, .config, metric = mean, rank) %>%
group_by(wflow_id) %>%
slice_min(rank, with_ties = FALSE) %>%
ungroup() %>%
arrange(rank)
workflow_results_id_best <-
workflow_results %>%
slice_min(rank, with_ties = FALSE) %>%
pull(wflow_id)
workflow_results_best <-
xgb_wflowmap %>%
extract_workflow_set_result(id = workflow_results_id_best) %>%
select_best(metric = "roc_auc")
set.seed(1234)
workflow_results_best_fit <-
xgb_wflowmap %>%
extract_workflow(workflow_results_id_best) %>%
finalize_workflow(workflow_results_best) %>%
last_fit(split = split, metrics = classification_metrics)
workflow_results_best_fit %>% collect_metrics()
workflow_results_best_fit %>% collect_predictions()
Thanks a lot for the example.
I changed some stuff (unrelated to the actual question) and made the full workflow a bit more compact.
To use shapviz for XGBoost model fitted via Tidymodels, I used the logic from my blog post here: https://lorentzen.ch/index.php/2023/01/27/shap-xgboost-tidymodels-love/
The same approach works for LightGBM models. For other models, the workflow is very easy via {kernelshap}.
library(modeldata)
library(tidymodels)
library(tidyverse)
data("credit_data")
set.seed(1234)
split <- initial_split(credit_data, prop = 0.90, strata = "Status")
train <- training(split)
test <- testing(split)
cv_folds <- vfold_cv(data = train, v = 5, strata = "Status")
preprocessor <- recipe(Status ~ ., data = train) %>%
step_integer(Home, Marital, Records, Job)
prep(preprocessor, head(train))
specification <- boost_tree(
mode = "classification",
tree_depth = tune(),
trees = 10000,
learn_rate = tune(),
stop_iter = 20
) %>%
set_engine("xgboost", nthread = 4, validation = 0.2)
workflow_xgb <- workflow(preprocessor, spec = specification)
# ~1 minute
tuned <- tune_grid(
workflow_xgb,
resamples = cv_folds,
grid = expand.grid(tree_depth = 2:4, learn_rate = c(0.05, 0.1)),
metrics = metric_set(mn_log_loss, roc_auc)
)
# How to use the number of rounds found by early-stopping instead of 20% internal validation??
best_fit <- workflow_xgb %>%
finalize_workflow(select_best(tuned, metric = "mn_log_loss")) %>%
last_fit(split)
Now the SHAP part:
library(shapviz)
# Will explain THIS dataset later
set.seed(2)
small <- train[sample(nrow(train), 1000), ]
small_prep <- bake(
prep(preprocessor),
has_role("predictor"),
new_data = small,
composition = "matrix"
)
head(small_prep)
shap <- shapviz(extract_fit_engine(best_fit), X_pred = small_prep, X = small)
sv_importance(shap, show_numbers = TRUE)
sv_dependence(shap, v = c("Seniority", "Income", "Amount", "Home"))
Thanks a lot @mayer79, I very much appreciate your improvements on this demo dataset as well as your explanations on the application of shapviz.
Q1; Why do you choose a reduced sample of the training data in shapviz-step-1 and as new_data in shapviz-step-2?
Q2: I wonder if there is another way to get the data for shapviz-step1 and -step2 directly from the last_fit() object or the workflow_map (xgb_wflowmap) object? Any ideas on this?
Q3: Furthermore, in case the target variable would not be binary but multiclass? Any adaptions needed especially for the shapviz-part in order to generate insights using sv_* functions?
Hmm.
shapviz(, ..., collapse = list(...))
. The core problem is that we need access to the original XGBoost model to be able to use the native TreeSHAP implementation. For any model except XGBoost or LightGBM, {kernelshap} does the trick without the long bake()
logic.object[[1]]
etc.I will close the issue, but really want to make a blog post or vignette on this!
Dear Authors,
I'm currently struggling with the shapviz explainer for tidymodels. I've checked the examples provided for the diamond package using only fit() for the training data.
What would be your recommended best practices for visualizing the following two tidymodels objects:
1.) Results from workflow_map() using cv_resamples and tuning_grid using an xgboost model on training data
2.) Results from last_fit() using the best model from 1.)
Thanks in advance for your suggestions.