bgreenwell / fastshap

Fast approximate Shapley values in R
https://bgreenwell.github.io/fastshap/
112 stars 18 forks source link

Approximate method with glm model returns zeros. #45

Closed esther-meerwijk closed 2 years ago

esther-meerwijk commented 2 years ago

I've been perusing various sites that describe how to determine approximate values with fastshap for a binomial glm model, but so far have been unsuccessful in making it work. Here's what I have been using:

x1 <- c(1,1,1,0,0,0,0,0,0,0)
x2 <- c(1,0,0,1,1,1,0,0,0,0)
x3 <- c(3,2,1,3,2,1,3,2,1,3)
x4 <- c(1,0,1,1,0,1,0,1,0,1)
y  <- c(1,0,1,0,1,1,0,0,0,1)

df <- data.frame(x1, x2, x3, x4, y)

fit <- glm(y ~ ., data=df, family=binomial)
X <- model.matrix(y ~., df)[,-1]

pfun <- function(object, newdata) {
  predict(object, type="response")
}

shap <- explain(fit , X = X, pred_wrapper = pfun, nsim = 10)

Here's the result:

> summary(shap)
       x1          x2          x3          x4   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0 

Obviously not what I expect. With the exact method, I do get values that make sense:

shap <- explain(fit , X = X, exact=TRUE, nsim = 10)
summary(shap)

       x1                x2                x3                 x4         
 Min.   :-0.3659   Min.   :-0.8149   Min.   :-0.62699   Min.   :-1.0497  
 1st Qu.:-0.3659   1st Qu.:-0.8149   1st Qu.:-0.62699   1st Qu.:-1.0497  
 Median :-0.3659   Median :-0.8149   Median : 0.06967   Median : 0.6998  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.5489   3rd Qu.: 1.2223   3rd Qu.: 0.59215   3rd Qu.: 0.6998  
 Max.   : 0.8538   Max.   : 1.2223   Max.   : 0.76632   Max.   : 0.6998  

but I cannot use the exact method on my actual data because the model features are not independent. Any help getting this to work would be appreciated!

bgreenwell commented 2 years ago

Hi @esther-meerwijk, I just ran your code and I get the same results…very strange. I’m on vacation but will try to figure out what’s going on later this week.

bgreenwell commented 2 years ago

Hi @esther-meerwijk, couple of small tweaks to fix your script:

Code and output below:

x1 <- c(1,1,1,0,0,0,0,0,0,0)
x2 <- c(1,0,0,1,1,1,0,0,0,0)
x3 <- c(3,2,1,3,2,1,3,2,1,3)
x4 <- c(1,0,1,1,0,1,0,1,0,1)
y  <- c(1,0,1,0,1,1,0,0,0,1)

df <- data.frame(x1, x2, x3, x4, y)
X <- subset(df, select = -y)  # features only

fit <- glm(y ~ ., data=df, family=binomial)

pfun <- function(object, newdata) {
  predict(object, type = "link", newdata = newdata)
}

set.seed(845)  # for reproduicibility
head(shap1 <- explain(fit , X = X, pred_wrapper = pfun, nsim = 1000))
# # A tibble: 6 × 4
#       x1     x2      x3     x4
#    <dbl>  <dbl>   <dbl>  <dbl>
# 1  0.853  1.22  -0.639   0.723
# 2  0.848 -0.807  0.0390 -1.03 
# 3  0.868 -0.854  0.748   0.696
# 4 -0.379  1.24  -0.601   0.682
# 5 -0.392  1.17   0.0620 -1.01 
# 6 -0.381  1.24   0.777   0.693

head(shap2 <- explain(fit , X = X, exact = TRUE))
# A tibble: 6 × 4
#       x1     x2      x3     x4
#    <dbl>  <dbl>   <dbl>  <dbl>
# 1  0.854  1.22  -0.627   0.700
# 2  0.854 -0.815  0.0697 -1.05 
# 3  0.854 -0.815  0.766   0.700
# 4 -0.366  1.22  -0.627   0.700
# 5 -0.366  1.22   0.0697 -1.05 
# 6 -0.366  1.22   0.766   0.700
esther-meerwijk commented 2 years ago

Yep, that does it 👍 Thanks so much for figuring that out!