Closed sebsilas closed 1 month ago
Which XGB version are you using?
Which XGB version are you using?
1.7.7.1
I can't run your code, so its all about speculation.
You can check what
s <- predict(xgb_model, X_pred = as.matrix(audio_features_prepped), predcontrib = TRUE)
gives. The result s
should be a list with num classes matrices of dimension (nrow(X_pred), ncol(X_pred) + 1).
This should work:
library(shapviz)
library(xgboost)
params <- list(objective = "multi:softprob", num_class = 3)
X_pred <- data.matrix(iris[, -5])
dtrain <- xgboost::xgb.DMatrix(X_pred, label = as.integer(iris[, 5]) - 1)
fit <- xgb.train(params = params, data = dtrain, nrounds = 10)
x <- shapviz(fit, X_pred = X_pred)
x
sv_importance(x)
sv_dependence(x$Class_3, v = colnames(iris[, -5]))
I can't run your code, so its all about speculation.
You can check what
s <- predict(xgb_model, X_pred = as.matrix(audio_features_prepped), predcontrib = TRUE)
gives. The result
s
should be a list with num classes matrices of dimension (nrow(X_pred), ncol(X_pred) + 1).
Yes, (if I change X_pred
to newdata
), I get a list with length == number of classes and matrices of the dimensions you specified (+1 being the BIAS
column).
Actually, you can access the model in question lyricassessr::syllable_classifier_bundle
via:
devtools::install_github('sebsilas/lyricassessr')
This should work:
library(shapviz) library(xgboost) params <- list(objective = "multi:softprob", num_class = 3) X_pred <- data.matrix(iris[, -5]) dtrain <- xgboost::xgb.DMatrix(X_pred, label = as.integer(iris[, 5]) - 1) fit <- xgb.train(params = params, data = dtrain, nrounds = 10) x <- shapviz(fit, X_pred = X_pred) x sv_importance(x) sv_dependence(x$Class_3, v = colnames(iris[, -5]))
This does work for me. But I am fitting the model via tidymodels
then extracting it.
I tried replicating this which fits the xgboost model via tidymodels
, as much as I could too, but I still have an issue.
Not sure...
library(shapviz)
packageVersion("shapviz") # ‘0.9.3’
audio_features <- bundle::unbundle(lyricassessr::vowel_metadata_with_audio_features_range_restricted_without_AEIOU)
# run your code
shap_values <- shapviz(xgb_model, X_pred = as.matrix(audio_features_prepped))
sv_importance(shap_values)
Not sure...
library(shapviz) packageVersion("shapviz") # ‘0.9.3’ audio_features <- bundle::unbundle(lyricassessr::vowel_metadata_with_audio_features_range_restricted_without_AEIOU) # run your code shap_values <- shapviz(xgb_model, X_pred = as.matrix(audio_features_prepped)) sv_importance(shap_values)
Hm, well that can't be right. bundle::unbundle
is meant to unbundle a serialized model object (see here).
How are you defining xgb_model
and audio_features_prepped
in your example?
Both according to your code. But since your code lacks the definition of audio_features
, I took the first object that would not produce an error.^^
This is with everything unneccessary removed (I think):
library(shapviz)
library(tidymodels)
syllable_mod <- bundle::unbundle(lyricassessr::syllable_classifier_bundle)
audio_features <- bundle::unbundle(lyricassessr::vowel_metadata_with_audio_features_range_restricted_without_AEIOU)
xgb_model <- parsnip::extract_fit_engine(syllable_mod)
audio_features_prepped <- recipes::bake(
lyricassessr::prepped_recipe,
recipes::has_role("predictor"),
new_data = audio_features,
composition = "matrix"
)
shap_values <- shapviz(xgb_model, X_pred = audio_features_prepped)
sv_importance(shap_values)
Right okay, I can reproduce this now.
I think I have maybe misunderstood what you can do with Shapley values. For some reason, I thought for a new prediction, you could work out which variables are contributing most to a given prediction. But actually, you can only do this with the full training dataset, right? So I get the same error I originally reported when I update your example to e.g., slice_sample
one row off the training data. It only works with the full dataset.. my bad!
If you give me "audio_features", I can probably make things run correctly. Currently, I am not 100% certain if XGBoost gets the right input. It was trained on 14 features, but we are passing 31.
You can use any plot, e.g.:
sv <- shapviz(xgb_model, X_pred = audio_features_prepped[1:2, , drop = FALSE])
sv_waterfall(sv$Class_1, row_id = 1)
Actually, when I only pass a single row to X_pred, I get your error message. The reason is that XGBoost then returns the SHAP values as a list of vectors, instead of a list of matrices.
Ah, so this is the issue then: I am only ever using 1 row!
Ok, many thanks. Do let me know when you've pushed a fix and I'll be happy to test it.
I assume replicating the row will not effect the estimates in any biased way?
By the way, as a separate point, this link says that if you pass the original data structure to the X
argument, you will get labelled classes, but I can't replicate that here (I get Class_1, Class_2 etc. as in your screenshot):
shap_values <- shapviz(xgb_model, X_pred = as.matrix(audio_features_prepped), X = as.matrix(audio_features))
sv_importance(shap_values)
Should I be able to get the proper class labels (how if so?)
XGB loses the class labels. But you can set them as names(shapviz_object) <- ....
(like a list)
Hm ok, I hope this is a safe way of doing it (i.e., guaranteeing the correct assignment)!
outcome_classes <- audio_features$syllable %>% as.character() %>% unique()
names(shap_values) <- outcome_classes
It must be in the same order as tidymodels does the class assignment when creating the response for the XGBoost backend. I let you figure this out. It is not under my control.
I think it should be levels(train_y)
(if factor), or levels(as.factor(train_y))
.
When I run
shapviz::shapviz
, I get the error:I believe I am using the correct predictors for the model, which are being "retained" for me by
recipes::bake
.I'm also removing NAs, just incase.
Any idea what is going on here?