ModelOriented / DALEXtra

Extensions for the DALEX package
https://ModelOriented.github.io/DALEXtra/
66 stars 10 forks source link

Interpreting model_parts() plot from DALEX or DALEXtra package #80

Closed tengku-hanis closed 2 years ago

tengku-hanis commented 2 years ago

Hi, sorry if this is not appropriate to be asked here.

I tried to build a classification model and explain using DALEX package. Below is the reprex what I'm trying to do.

# Packages 
library(tidymodels)
library(mlbench)

# Data 
data("PimaIndiansDiabetes")
dat <- PimaIndiansDiabetes 
dat$some_new_group[1:384] <- "group 1" 
dat$some_new_group[385:768] <- "group 2"

# Split
set.seed(123)
ind <- initial_split(dat)
dat_train <- training(ind)
dat_test <- testing(ind)

# CV
set.seed(123)
dat_cv <- vfold_cv(dat_train, v = 10)

# Recipes
svm_rec <- 
  recipe(diabetes ~., data = dat_train) %>% 
  update_role(some_new_group, new_role = "group_var") %>% 
  step_rm(pressure) %>% 
  step_YeoJohnson(all_numeric_predictors())

# Model spec 
svm_spec <- 
  svm_rbf() %>% 
  set_mode("classification") %>% 
  set_engine("kernlab")

# Workflow 
svm_wf <- 
  workflow() %>% 
  add_recipe(svm_rec) %>% 
  add_model(svm_spec)

# Train
svm_trained <- 
  svm_wf %>% 
  fit(dat_train)

Notice in the recipes above, I removed variable pressure and make a new categorical variable (some_new_group).

Next, I try explain this model using DALEX.

# Explainer
library(DALEXtra)

svm_exp <- explain_tidymodels(svm_trained, 
                              data = dat %>% select(-diabetes), 
                              y = dat$diabetes %>% as.numeric(), 
                              label = "SVM")

## Variable importance
set.seed(123)
svm_vp <- model_parts(svm_exp, type = "variable_importance") 
svm_vp

Result of svm_vp.

         variable mean_dropout_loss label
1    _full_model_         0.6762916   SVM
2         glucose         0.5827101   SVM
3             age         0.6584117   SVM
4            mass         0.6599677   SVM
5        pregnant         0.6609174   SVM
6        pedigree         0.6620800   SVM
7         insulin         0.6686974   SVM
8         triceps         0.6691379   SVM
9        pressure         0.6762916   SVM
10 some_new_group         0.6762916   SVM
11     _baseline_         0.5017774   SVM    

Here is the plot.

plot(svm_vp) +
  ggtitle("Mean-variable importance over 50 permutations", "") 

Rplot

So,based on the plot, the most influential variable is glucose right? It does not make sense for some_new_group and pressure variables to be the most important variable as we do not use these variables in the model fitting. I have seen this post and this post and my plot looks a bit different as my most important variable is at the bottom while in both posts the most important variable is at the top. Even the direction of my bar plot is different. I attached one of the plots from the post as a comparison. enter image description here

Did I miss something in the R code? or miss a certain step?

maksymiuks commented 2 years ago

Hi,

thank you for your issue, I hope to be able to help.

First of all, let's understand what measure is shown in the plot. It is 1-AUC loss function. It is used instead of standard AUC because we are utilizing the property that lower loss function values indicate that the model has better performance.

That being said, looking at the first plot, we see that the baseline 1-AUC for this model is roughly 0.675, hence AUC is 0.325. Because AUC is symmetrical around 0.5, I suspect you've inverted the labels and that is why you get such extraordinary results, as according to the plot, permutation to glucose made the model significantly better. It obviously does not make sense in general but in this particular case is understandable, taking into consideration the nature of 1-AUC loss. Switching the labels (or using predict_function_target_column parameter) should solve the issue.

In general, small fluctuations, and even bars spreading to the left are possible even with properly set labels (like in the second plot male_single variable). It usually means then that the variable is extremely insignificant for a given trained model, as removing the information from them, increases the performance.

I hope this clears out some of your concerns

tengku-hanis commented 2 years ago

Thanks, using predict_function_target_column parameter clear up my confusion.