bgreenwell / fastshap

Fast approximate Shapley values in R
https://bgreenwell.github.io/fastshap/
113 stars 18 forks source link

Can we use fastshap to explain isolation forest in R? #22

Closed oabhitej closed 3 years ago

oabhitej commented 3 years ago

Currently, Fastshap is only designed for supervised learning as per the package description. Can we use this package to explain unsupervised learning algorithms like isolation forest? I see we could already use isoTree with shapper package and getting similar functionality on fastshap would be amazing.

bgreenwell commented 3 years ago

@oabhitej Yes you can. You can technically use fastshap (or any other package that supports Shapley-like explanations, like iml or iBreakDown) to explain any model that can produce scores/predictions for new data. And while I haven't seen it used in this context, I think it makes perfect sense. Here's an example from an upcoming book I'm writing on tree-based methods for CRC Press using fastshap to explain observations with high anomaly scores from an isolation forest using a well-known credit card fraud data set:

library(isotree)

ccfraud <- data.table::fread("../data/ccfraud.csv")  # https://www.kaggle.com/mlg-ulb/creditcardfraud

# Randomize the data
set.seed(2117)  # for reproducibility
ccfraud <- ccfraud[sample(nrow(ccfraud)), ]

# Split data into train/test sets
set.seed(2013)  # for reproducibility
trn.id <- sample(nrow(ccfraud), size = 10000, replace = FALSE)
ccfraud.trn <- ccfraud[trn.id, ]
ccfraud.tst <- ccfraud[-trn.id, ]

# Fit a default isolation forest
ifo <- isolation.forest(ccfraud.trn[, 1L:30L], random_seed = 2223, nthreads = 1)

# Compute anomaly scores for the test observations
head(scores <- predict(ifo, newdata = ccfraud.tst))

# Training set anomaly scores
scores.trn <- predict(ifo, newdata = ccfraud.trn)
to.explain <- max(scores) - mean(scores.trn)

max.id <- which.max(scores)  # row ID for observation wit
max.x <- ccfraud.tst[max.id, ]
max(scores)
max.x  # observation to "explain" or compute feature contributions for

X <- ccfraud.trn[, 1L:30L]  # feature columns only
max.x <- max.x[, 1L:30L]  # feature columns only!
pfun <- function(object, newdata) {  # prediction wrapper
  predict(object, newdata = newdata)
}

# Generate feature contributions
set.seed(1351)  # for reproducibility
(ex <- fastshap::explain(ifo, X = X, newdata = max.x, pred_wrapper = pfun, 
                         adjust = TRUE, nsim = 1000))
sum(ex)  # should sum to f(x) - baseline whenever `adjust = TRUE` 

# Transpose feature contributions
res <- data.frame(
  "feature" = paste0(names(ex), "=", round(max.x, digits = 2)),
  "shapley.value" = as.numeric(as.vector(ex[1L,]))
)

# Plot feature contributions
ggplot(res, aes(x = shapley.value, y = reorder(feature, shapley.value))) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = "dashed") +
  xlab("Shapley value") +
  ylab("") +
  theme(axis.text.y = element_text(size = rel(0.8)))
bgreenwell commented 3 years ago

The interpretation of the output here is a bit moot since the feature names have been anonymized, but it illustrates the idea that feature contributions can be useful in explaining anomaly scores.

bgreenwell commented 3 years ago

rf-fraud-detection-iforest-shapley-plot-1.pdf

oabhitej commented 3 years ago

Thank you @bgreenwell , I know it is a lot to ask but do you also have a planned future release for Treeshap implementation within the fastshap package?

bgreenwell commented 3 years ago

A generic implementation is not on the roadmap, but fastshap does support TreeSHAP for xgboost and lightgbm models.

bgreenwell commented 3 years ago

I suspect you can use TreeSHAP with sklearn’s isolation forest. Wouldn’t be hard to wrap all of that in R using reticulate.