ModelOriented / shapviz

SHAP Plots in R
https://modeloriented.github.io/shapviz/
GNU General Public License v2.0
77 stars 12 forks source link

shapviz::shapviz: Error in s[, nms, drop = FALSE] : incorrect number of dimensions #141

Closed sebsilas closed 1 month ago

sebsilas commented 1 month ago

When I run shapviz::shapviz, I get the error:

Error in s[, nms, drop = FALSE] : incorrect number of dimensions

I believe I am using the correct predictors for the model, which are being "retained" for me by recipes::bake.

I'm also removing NAs, just incase.

Any idea what is going on here?


syllable_mod <- bundle::unbundle(lyricassessr::syllable_classifier_bundle)

# The tidymodels prediction works:

  preds <- predict(syllable_mod, new_data = audio_features, type = "prob") %>%
    dplyr::rename_with(~stringr::str_remove_all(.x, ".pred_")) %>%
    tidyr::pivot_longer(dplyr::everything(),
                        names_to = "Syllable", values_to = "Probability") %>%
    dplyr::arrange(dplyr::desc(Probability))

  xgb_model <- parsnip::extract_fit_engine(syllable_mod)

# Make sure there are no NAs, just incase (1 is a meaningless number for now..)
  audio_features <- audio_features %>%
    dplyr::mutate(across(everything(), ~ ifelse(is.na(.x), 1, .x)))

  audio_features_prepped <- recipes::bake(
    lyricassessr::prepped_recipe,
    recipes::has_role("predictor"),
    new_data = audio_features) %>%
    dplyr::mutate(across(everything(), ~ ifelse(is.na(.x), 1, .x)))

# Make sure audio_features agree with the retained (prepped) features
  audio_features <- audio_features %>%
    dplyr::select(dplyr::all_of(names(audio_features_prepped)))

shap_values <- shapviz::shapviz(xgb_model, X_pred = as.matrix(audio_features_prepped), X = as.matrix(audio_features))
mayer79 commented 1 month ago

Which XGB version are you using?

sebsilas commented 1 month ago

Which XGB version are you using?

1.7.7.1

mayer79 commented 1 month ago

I can't run your code, so its all about speculation.

You can check what

s <- predict(xgb_model, X_pred = as.matrix(audio_features_prepped), predcontrib = TRUE)

gives. The result s should be a list with num classes matrices of dimension (nrow(X_pred), ncol(X_pred) + 1).

mayer79 commented 1 month ago

This should work:

library(shapviz)
library(xgboost)

params <- list(objective = "multi:softprob", num_class = 3)
X_pred <- data.matrix(iris[, -5])
dtrain <- xgboost::xgb.DMatrix(X_pred, label = as.integer(iris[, 5]) - 1)
fit <- xgb.train(params = params, data = dtrain, nrounds = 10)

x <- shapviz(fit, X_pred = X_pred)
x
sv_importance(x)
sv_dependence(x$Class_3, v = colnames(iris[, -5]))
image image
sebsilas commented 1 month ago

I can't run your code, so its all about speculation.

You can check what

s <- predict(xgb_model, X_pred = as.matrix(audio_features_prepped), predcontrib = TRUE)

gives. The result s should be a list with num classes matrices of dimension (nrow(X_pred), ncol(X_pred) + 1).

Yes, (if I change X_pred to newdata), I get a list with length == number of classes and matrices of the dimensions you specified (+1 being the BIAS column).

Actually, you can access the model in question lyricassessr::syllable_classifier_bundle via:

devtools::install_github('sebsilas/lyricassessr')
sebsilas commented 1 month ago

This should work:

library(shapviz)
library(xgboost)

params <- list(objective = "multi:softprob", num_class = 3)
X_pred <- data.matrix(iris[, -5])
dtrain <- xgboost::xgb.DMatrix(X_pred, label = as.integer(iris[, 5]) - 1)
fit <- xgb.train(params = params, data = dtrain, nrounds = 10)

x <- shapviz(fit, X_pred = X_pred)
x
sv_importance(x)
sv_dependence(x$Class_3, v = colnames(iris[, -5]))

image image

This does work for me. But I am fitting the model via tidymodels then extracting it.

I tried replicating this which fits the xgboost model via tidymodels, as much as I could too, but I still have an issue.

mayer79 commented 1 month ago

Not sure...

library(shapviz)
packageVersion("shapviz") # ‘0.9.3’

audio_features <- bundle::unbundle(lyricassessr::vowel_metadata_with_audio_features_range_restricted_without_AEIOU)
# run your code
shap_values <- shapviz(xgb_model, X_pred = as.matrix(audio_features_prepped))
sv_importance(shap_values)

image

sebsilas commented 1 month ago

Not sure...

library(shapviz)
packageVersion("shapviz") # ‘0.9.3’

audio_features <- bundle::unbundle(lyricassessr::vowel_metadata_with_audio_features_range_restricted_without_AEIOU)
# run your code
shap_values <- shapviz(xgb_model, X_pred = as.matrix(audio_features_prepped))
sv_importance(shap_values)

image

Hm, well that can't be right. bundle::unbundle is meant to unbundle a serialized model object (see here).

How are you defining xgb_model and audio_features_prepped in your example?

mayer79 commented 1 month ago

Both according to your code. But since your code lacks the definition of audio_features, I took the first object that would not produce an error.^^

This is with everything unneccessary removed (I think):

library(shapviz)
library(tidymodels)

syllable_mod <- bundle::unbundle(lyricassessr::syllable_classifier_bundle)
audio_features <- bundle::unbundle(lyricassessr::vowel_metadata_with_audio_features_range_restricted_without_AEIOU)

xgb_model <- parsnip::extract_fit_engine(syllable_mod)

audio_features_prepped <- recipes::bake(
  lyricassessr::prepped_recipe,
  recipes::has_role("predictor"),
  new_data = audio_features,
  composition = "matrix"
)

shap_values <- shapviz(xgb_model, X_pred = audio_features_prepped)
sv_importance(shap_values)
sebsilas commented 1 month ago

Right okay, I can reproduce this now.

I think I have maybe misunderstood what you can do with Shapley values. For some reason, I thought for a new prediction, you could work out which variables are contributing most to a given prediction. But actually, you can only do this with the full training dataset, right? So I get the same error I originally reported when I update your example to e.g., slice_sample one row off the training data. It only works with the full dataset.. my bad!

mayer79 commented 1 month ago

If you give me "audio_features", I can probably make things run correctly. Currently, I am not 100% certain if XGBoost gets the right input. It was trained on 14 features, but we are passing 31.

mayer79 commented 1 month ago

You can use any plot, e.g.:

sv <- shapviz(xgb_model, X_pred = audio_features_prepped[1:2, , drop = FALSE])
sv_waterfall(sv$Class_1, row_id = 1)

image

Actually, when I only pass a single row to X_pred, I get your error message. The reason is that XGBoost then returns the SHAP values as a list of vectors, instead of a list of matrices.

sebsilas commented 1 month ago

Ah, so this is the issue then: I am only ever using 1 row!

mayer79 commented 1 month ago
sebsilas commented 1 month ago

Ok, many thanks. Do let me know when you've pushed a fix and I'll be happy to test it.

I assume replicating the row will not effect the estimates in any biased way?

sebsilas commented 1 month ago

By the way, as a separate point, this link says that if you pass the original data structure to the X argument, you will get labelled classes, but I can't replicate that here (I get Class_1, Class_2 etc. as in your screenshot):

shap_values <- shapviz(xgb_model, X_pred = as.matrix(audio_features_prepped), X = as.matrix(audio_features))

sv_importance(shap_values)

Should I be able to get the proper class labels (how if so?)

mayer79 commented 1 month ago

XGB loses the class labels. But you can set them as names(shapviz_object) <- .... (like a list)

sebsilas commented 1 month ago

Hm ok, I hope this is a safe way of doing it (i.e., guaranteeing the correct assignment)!

outcome_classes <- audio_features$syllable %>% as.character() %>%  unique()
names(shap_values) <- outcome_classes
mayer79 commented 1 month ago

It must be in the same order as tidymodels does the class assignment when creating the response for the XGBoost backend. I let you figure this out. It is not under my control.

mayer79 commented 1 month ago

I think it should be levels(train_y) (if factor), or levels(as.factor(train_y)).