NorskRegnesentral / shapr

Explaining the output of machine learning models with more accurately estimated Shapley values
https://norskregnesentral.github.io/shapr/
Other
138 stars 32 forks source link

Error during "explain" #354

Closed GuGuaTT closed 10 months ago

GuGuaTT commented 10 months ago

Hi! I got this error when I use "explain",

Error in prediction(dt, prediction_zero, explainer) : nrow(explainer$x_test) == dt[, max(id)] is not TRUE

What does it mean?

martinju commented 10 months ago

Hi. It seems the prediction function does not provide the right dimension as output. Have you created a custom prediction function? To assist you further, please provide a minimal reproducible example with the failing code.

GuGuaTT commented 10 months ago

Hello! Thank you for your prompt reply! My test code is here,

# 'all' is a (1494, 20) table, with the first 19 dimensions as features and the last dimension as output values
all <- data.table::as.data.table(cbind(as.matrix(data), as.matrix(output)))
names <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
           "k", "l", "m", "n", "o", "p", "q", "r", "s", "t")
colnames(all) <- names

# Assign inputs and outputs
x_var <- names[1:19]
y_var <- names[20]
x_train <- as.matrix(all[ , ..x_var])
y_train <- all[ , get(y_var)]

# Fit a basic xgboost model
model <- xgboost(
  data = x_train,
  label = y_train,
  nround = 50,
  verbose = FALSE
)

$ Visualize results
pred <- predict(model, x_train)

# Specify the expected prediction without any features and setup explainer
p0 <-  mean(y_train)
explainer <- shapr(x_train, model, n_combinations=10000)

# Test with the first 10 training data
test <- x_train[1:10, ]

explain <- explain(test, explainer = explainer,
                    approach = "empirical",
                    prediction_zero = p
)

Then I got the error I mentioned.

GuGuaTT commented 10 months ago

Hello! As I cannot solve the problem above, I have tested with your new r package, but a new problem occurred. I think we can turn to this problem instead. My code is this,

data2 <- data.table::as.data.table(cbind(as.matrix(data2), as.matrix(output)))
names <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
           "k", "l", "m", "n", "o", "p", "q", "r", "s", "t")
colnames(data2) <- names

x_var2 <- names[1:9]
y_var2 <- names[20]

x_train2 <- data2[ , ..x_var2]
y_train2 <- data2[ , get(y_var2)]
x_explain2 <- x_train2[1:5, ]

# Fitting a basic xgboost model to the training data
model2 <- xgboost(
  data = as.matrix(x_train2),
  label = y_train2,
  nround = 100,
  verbose = FALSE
)

# Specifying the expected prediction without any features
p02 <-  mean(y_train2)

explanation2 <- explain(
  model = model2,
  x_explain = as.matrix(x_explain2),
  x_train = as.matrix(x_train2),
  approach = "gaussian",
  prediction_zero = p02,
  n_combinations = NULL
)

print(explanation2$shapley_values)

The problem is, if I only use eight features to build this prediction model, (x_var2 <- names[1:8]), then the code can pass. However, if I use any feature number above or equal to 9 (x_var2 <- names[1:9]), the code cannot pass with error,

Error in setnames(x, value) : Can't assign 15011 names to a 6 column data.table

This error occurs when I execute the explain command. Could you take a look? I am happy to provide you the data if you want. (BTW, I have just read the paper related to this package these days and it is really wonderful, thank you!)

martinju commented 10 months ago

Hi!

I took a look now, reproduced your issue and found that the issue is that you are using the feature name "i". We should fix this, but for now, a simple workaround is to not use "i" as feature name. This is also the case of "w" if you ever increase the number of features in your model.

Note to self: Introduce a check for protected feature names ("i", "w","p_hat", "id", "id_combination", etc. and temporary transform the feature names if any of these appears as feature names.)

GuGuaTT commented 10 months ago

Thank you! It fixes the problem!