ModelOriented / DALEX

moDel Agnostic Language for Exploration and eXplanation
https://dalex.drwhy.ai
GNU General Public License v3.0
1.38k stars 166 forks source link

Error in `predict_surrogate` when `new_observation` has a target value #487

Closed Nehagupta90 closed 2 years ago

Nehagupta90 commented 2 years ago

I have used Break-down method for instance level explanation and does work fine. I have never used LIME method and now when I am using it, it gives me the following error:

Error in [.data.frame(explainer$data, , colnames(new_observation)) : undefined columns selected

My code is:

explainer5 = explain_mlr3(model5, data = test[,-21], y = as.numeric(test$report)-1, label="SVM")

new_observation= test[6,] plot(predict_parts(explainer5, new_observation = new_observation, type = "break_down_interactions")) //// This works fine

/// Problem is in the following code

model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainer predict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer

lime_tool <- predict_surrogate(explainer = explainer5, new_observation = new_observation, n_features = 3, n_permutations = 1000, type = "lime")

Error in [.data.frame(explainer$data, , colnames(new_observation)) : undefined columns selected

What could be the problem? I am taking help from the example in https://ema.drwhy.ai/LIME.html

hbaniecki commented 2 years ago

Hi, can you provide a reproducible example from loading the data through creating the model to obtaining the error? E.g. you could train your mlr3 model on the example u mentioned:

titanic_imputed <- archivist::aread("pbiecek/models/27e5c")
titanic_rf <- # mlr3 model based on titanic_imputed
johnny_d <- archivist:: aread("pbiecek/models/e3596")

library("randomForest")
library("DALEX")
titanic_rf_exp <- DALEX::explain(model = titanic_rf,  
                                 data = titanic_imputed[, -9],
                                 y = titanic_imputed$survived == "yes", 
                                 label = "Random Forest")

set.seed(1)
library("DALEXtra")
library("lime")
model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainer
predict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer

lime_johnny <- predict_surrogate(explainer = titanic_rf_exp, 
                                 new_observation = johnny_d, 
                                 n_features = 3, 
                                 n_permutations = 1000,
                                 type = "lime")

(as.data.frame(lime_johnny))

plot(lime_johnny)
Nehagupta90 commented 2 years ago

The following code I am using, which works with Breakdown method:

data = readARFF("xalan.arff") index= sample(1:nrow(data), 0.7*nrow(data)) train= data[index,] test= data[-index,] task = TaskRegr$new("data", backend = train, target = "bug")

print(task)

learner=lrn("regr.ksvm")

model= learner$train(task )

explainer2 = explain_mlr3(model, data = test[,-21], y = as.numeric(test$bug)-1, label="SVM")

new_observation= test[36,]

The following works with Breakdown

plot(predict_parts(explainer2, new_observation = new_observation, type = "break_down_interactions"))

The following doesnt work

model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainer predict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer

lime_johnny <- predict_surrogate(explainer = explainer2, new_observation = new_observation, n_features = 3, n_permutations = 1000, type = "lime")

Libraries I used are the following

library(farff) library(mlr3learners) library(mlr3extralearners) library(mlr3) library(DALEX) library(DALEXtra)

library(lime) library(ingredients) library(ceterisParibus)

On Mon, Mar 14, 2022 at 5:26 PM Hubert Baniecki @.***> wrote:

Hi, can you provide a reproducible example from loading the data through creating the model to obtaining the error? E.g. you could train your mlr3 model on the example u mentioned:

titanic_imputed <- archivist::aread("pbiecek/models/27e5c")titanic_rf <- # mlr3 model based on titanic_imputedjohnny_d <- archivist:: aread("pbiecek/models/e3596")

library("randomForest") library("DALEX")titanic_rf_exp <- DALEX::explain(model = titanic_rf, data = titanic_imputed[, -9], y = titanic_imputed$survived == "yes", label = "Random Forest")

set.seed(1) library("DALEXtra") library("lime")model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainerpredict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer lime_johnny <- predict_surrogate(explainer = titanic_rf_exp, new_observation = johnny_d, n_features = 3, n_permutations = 1000, type = "lime")

(as.data.frame(lime_johnny))

plot(lime_johnny)

— Reply to this email directly, view it on GitHub https://github.com/ModelOriented/DALEX/issues/487#issuecomment-1067029586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN2ZZ2JVQEZQFQUBOKMZLJLU75SE3ANCNFSM5QSBQ6UQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

hbaniecki commented 2 years ago

Hi, this code works for me, so unless you share data or provide a reproducible example, I might not be able to help you.

library(mlr3learners)
library(mlr3extralearners)
library(mlr3)
library(DALEX)
library(DALEXtra)

library(lime)

index= sample(1:nrow(titanic_imputed), 0.7*nrow(titanic_imputed))
train= titanic_imputed[index,]
test= titanic_imputed[-index,]
task = TaskRegr$new("data", backend = train, target = "survived")

print(task)

learner=lrn("regr.ksvm")

model= learner$train(task )

explainer2 = explain_mlr3(model,
                          data = test[,-21],
                          y = as.numeric(test$survived),
                          label="SVM")

new_observation= test[36,]

### The following works with Breakdown

plot(predict_parts(explainer2,
                   new_observation = new_observation,
                   type = "break_down_interactions"))

## The following WORKS

model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainer
predict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer

lime_johnny <- predict_surrogate(explainer = explainer2,
                                 new_observation = new_observation,
                                 n_features = 3,
                                 n_permutations = 1000,
                                 type = "lime")
plot(lime_johnny)

You can also try to update all the used libraries. My session info:

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lime_0.5.2               DALEXtra_2.1.1           DALEX_2.4.0              mlr3extralearners_0.5.18
[5] mlr3learners_0.5.1       mlr3_0.13.0             

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7           paradox_0.7.1        lubridate_1.8.0      lattice_0.20-44     
 [5] listenv_0.8.0        png_0.1-7            palmerpenguins_0.1.0 assertthat_0.2.1    
 [9] glmnet_4.1-3         digest_0.6.29        foreach_1.5.1        utf8_1.2.2          
[13] parallelly_1.28.1    R6_2.5.1             backports_1.4.1      RSQLite_2.2.9       
[17] httr_1.4.2           ggplot2_3.3.5        pillar_1.7.0         flock_0.7           
[21] rlang_1.0.1          uuid_0.1-4           rstudioapi_0.13      data.table_1.14.2   
[25] kernlab_0.9-29       blob_1.2.2           Matrix_1.3-4         checkmate_2.0.0     
[29] reticulate_1.22      labeling_0.4.2       splines_4.1.1        gower_0.2.2         
[33] RCurl_1.98-1.5       bit_4.0.4            munsell_0.5.0        compiler_4.1.1      
[37] pkgconfig_2.0.3      shape_1.4.6          globals_0.14.0       tidyselect_1.1.1    
[41] tibble_3.1.6         lgr_0.4.3            mlr3misc_0.9.5       codetools_0.2-18    
[45] fansi_1.0.2          future_1.23.0        crayon_1.5.0         dplyr_1.0.7         
[49] bitops_1.0-7         rappdirs_0.3.3       grid_4.1.1           jsonlite_1.8.0      
[53] gtable_0.3.0         lifecycle_1.0.1      DBI_1.1.2            magrittr_2.0.2      
[57] scales_1.1.1         archivist_2.3.6      stringi_1.7.6        cli_3.2.0           
[61] cachem_1.0.6         farver_2.1.0         iBreakDown_2.0.1     ellipsis_0.3.2      
[65] generics_0.1.1       vctrs_0.3.8          iterators_1.0.13     tools_4.1.1         
[69] bit64_4.0.5          glue_1.6.1           purrr_0.3.4          survival_3.2-13     
[73] parallel_4.1.1       fastmap_1.1.0        colorspace_2.0-3     memoise_2.0.0
Nehagupta90 commented 2 years ago

Another thing, how the above code works for you as you used the titanic dataset. The output variable of the titanic dataset is classification-based while I used regression-based learner. Doesn't it matter?

On Mon, Mar 14, 2022 at 5:54 PM Neha gupta @.***> wrote:

How can I share my data? Can I attach the dataset?

On Mon, Mar 14, 2022 at 5:49 PM Hubert Baniecki @.***> wrote:

Hi, this code works for me, so unless you share data or provide a reproducible example, I might not be able to help you.

library(mlr3learners) library(mlr3extralearners) library(mlr3) library(DALEX) library(DALEXtra)

library(lime) index= sample(1:nrow(titanic_imputed), 0.7*nrow(titanic_imputed))train= titanic_imputed[index,]test= titanic_imputed[-index,]task = TaskRegr$new("data", backend = train, target = "survived")

print(task) learner=lrn("regr.ksvm") model= learner$train(task ) explainer2 = explain_mlr3(model, data = test[,-21], y = as.numeric(test$survived), label="SVM") new_observation= test[36,]

The following works with Breakdown

plot(predict_parts(explainer2, new_observation = new_observation, type = "break_down_interactions"))

The following WORKS

model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainerpredict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer lime_johnny <- predict_surrogate(explainer = explainer2, new_observation = new_observation, n_features = 3, n_permutations = 1000, type = "lime") plot(lime_johnny)

You can also try to update all the used libraries. My session info:

R version 4.1.1 (2021-08-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] lime_0.5.2 DALEXtra_2.1.1 DALEX_2.4.0 mlr3extralearners_0.5.18 [5] mlr3learners_0.5.1 mlr3_0.13.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.7 paradox_0.7.1 lubridate_1.8.0 lattice_0.20-44 [5] listenv_0.8.0 png_0.1-7 palmerpenguins_0.1.0 assertthat_0.2.1 [9] glmnet_4.1-3 digest_0.6.29 foreach_1.5.1 utf8_1.2.2 [13] parallelly_1.28.1 R6_2.5.1 backports_1.4.1 RSQLite_2.2.9 [17] httr_1.4.2 ggplot2_3.3.5 pillar_1.7.0 flock_0.7 [21] rlang_1.0.1 uuid_0.1-4 rstudioapi_0.13 data.table_1.14.2 [25] kernlab_0.9-29 blob_1.2.2 Matrix_1.3-4 checkmate_2.0.0 [29] reticulate_1.22 labeling_0.4.2 splines_4.1.1 gower_0.2.2 [33] RCurl_1.98-1.5 bit_4.0.4 munsell_0.5.0 compiler_4.1.1 [37] pkgconfig_2.0.3 shape_1.4.6 globals_0.14.0 tidyselect_1.1.1 [41] tibble_3.1.6 lgr_0.4.3 mlr3misc_0.9.5 codetools_0.2-18 [45] fansi_1.0.2 future_1.23.0 crayon_1.5.0 dplyr_1.0.7 [49] bitops_1.0-7 rappdirs_0.3.3 grid_4.1.1 jsonlite_1.8.0 [53] gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.2 [57] scales_1.1.1 archivist_2.3.6 stringi_1.7.6 cli_3.2.0 [61] cachem_1.0.6 farver_2.1.0 iBreakDown_2.0.1 ellipsis_0.3.2 [65] generics_0.1.1 vctrs_0.3.8 iterators_1.0.13 tools_4.1.1 [69] bit64_4.0.5 glue_1.6.1 purrr_0.3.4 survival_3.2-13 [73] parallel_4.1.1 fastmap_1.1.0 colorspace_2.0-3 memoise_2.0.0

— Reply to this email directly, view it on GitHub https://github.com/ModelOriented/DALEX/issues/487#issuecomment-1067054211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN2ZZ2JTOK45DWKUY7D5573U75U3DANCNFSM5QSBQ6UQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

Nehagupta90 commented 2 years ago

xalan.csv

I attached the dataset

hbaniecki commented 2 years ago

Doesn't it matter?

You can do a regression into a 0-1 variable; it is just a technicality.

Ok, I ran your example and the error can be fixed by using:

new_observation = test[36, -21] # target variable breaks the code

Target variable in new_observation breaks the following line in predict_surrogate():

https://github.com/ModelOriented/DALEXtra/blob/bf480f8e315455c06c14cd86dc9d47cfb5dfa383/R/predict_surrogate.R#L74-L75

Nehagupta90 commented 2 years ago

Thanks a lot, Hebert.

It now works. Just a warning message

Warning message: noc does not contain enough variance to use quantile binning. Using standard binning instead.

On Mon, Mar 14, 2022 at 6:27 PM Hubert Baniecki @.***> wrote:

Doesn't it matter?

You can do a regression into a 0-1 variable; it is just a technicality.

Ok, I ran your example and the error can be fixed by using:

new_observation = test[36, -21] # target variable breaks the code

Target variable in new_observation breaks the following line in predict_surrogate():

https://github.com/ModelOriented/DALEXtra/blob/bf480f8e315455c06c14cd86dc9d47cfb5dfa383/R/predict_surrogate.R#L74-L75

— Reply to this email directly, view it on GitHub https://github.com/ModelOriented/DALEX/issues/487#issuecomment-1067093636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN2ZZ2MZEHVTCTFQO3LCWF3U75ZILANCNFSM5QSBQ6UQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

Nehagupta90 commented 2 years ago

Hello Hubert

When we use the Breakdown method, we use something the following figure. The values next to the input metrics are the values of this particular instance. However, what represents the values next to metrics produced by the Lime method? It shows values with < and > signs. What does it represent?

[image: image.png]

On Mon, Mar 14, 2022 at 6:35 PM Neha gupta @.***> wrote:

Thanks a lot, Hebert.

It now works. Just a warning message

Warning message: noc does not contain enough variance to use quantile binning. Using standard binning instead.

On Mon, Mar 14, 2022 at 6:27 PM Hubert Baniecki @.***> wrote:

Doesn't it matter?

You can do a regression into a 0-1 variable; it is just a technicality.

Ok, I ran your example and the error can be fixed by using:

new_observation = test[36, -21] # target variable breaks the code

Target variable in new_observation breaks the following line in predict_surrogate():

https://github.com/ModelOriented/DALEXtra/blob/bf480f8e315455c06c14cd86dc9d47cfb5dfa383/R/predict_surrogate.R#L74-L75

— Reply to this email directly, view it on GitHub https://github.com/ModelOriented/DALEX/issues/487#issuecomment-1067093636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN2ZZ2MZEHVTCTFQO3LCWF3U75ZILANCNFSM5QSBQ6UQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

hbaniecki commented 2 years ago

Hi, quoting the EMA book (https://ema.drwhy.ai/LIME.html, Section 9.4 Example: Titanic data):

In this example, however, we have got a relatively small number of variables, so we will use a simpler data representation in the form of a binary vector. Toward this aim, each variable is dichotomized into two levels. For example, age is transformed into a binary variable with categories “≤15.36” and “>15.36”, class is transformed into a binary variable with categories “1st/2nd/deck crew” and “other”, and so on.

Hope this helps.

Nehagupta90 commented 2 years ago

I will read the details in the book to further clarity things. Thank you again for your support and helpful information.

On Monday, March 14, 2022, Hubert Baniecki @.***> wrote:

Hi, quoting the EMA book (https://ema.drwhy.ai/LIME.html, Section 9.4 Example: Titanic data):

In this example, however, we have got a relatively small number of variables, so we will use a simpler data representation in the form of a binary vector. Toward this aim, each variable is dichotomized into two levels. For example, age is transformed into a binary variable with categories “≤15.36” and “>15.36”, class is transformed into a binary variable with categories “1st/2nd/deck crew” and “other”, and so on.

Hope this helps.

— Reply to this email directly, view it on GitHub https://github.com/ModelOriented/DALEX/issues/487#issuecomment-1067197992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN2ZZ2J3YXOLEKU42ALSQ2DU76GS5ANCNFSM5QSBQ6UQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

hbaniecki commented 2 years ago

fixed https://github.com/ModelOriented/DALEXtra/issues/73

Nehagupta90 commented 2 years ago

How can I share my data? Can I attach the dataset?

On Mon, Mar 14, 2022 at 5:49 PM Hubert Baniecki @.***> wrote:

Hi, this code works for me, so unless you share data or provide a reproducible example, I might not be able to help you.

library(mlr3learners) library(mlr3extralearners) library(mlr3) library(DALEX) library(DALEXtra)

library(lime) index= sample(1:nrow(titanic_imputed), 0.7*nrow(titanic_imputed))train= titanic_imputed[index,]test= titanic_imputed[-index,]task = TaskRegr$new("data", backend = train, target = "survived")

print(task) learner=lrn("regr.ksvm") model= learner$train(task ) explainer2 = explain_mlr3(model, data = test[,-21], y = as.numeric(test$survived), label="SVM") new_observation= test[36,]

The following works with Breakdown

plot(predict_parts(explainer2, new_observation = new_observation, type = "break_down_interactions"))

The following WORKS

model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainerpredict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer lime_johnny <- predict_surrogate(explainer = explainer2, new_observation = new_observation, n_features = 3, n_permutations = 1000, type = "lime") plot(lime_johnny)

You can also try to update all the used libraries. My session info:

R version 4.1.1 (2021-08-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] lime_0.5.2 DALEXtra_2.1.1 DALEX_2.4.0 mlr3extralearners_0.5.18 [5] mlr3learners_0.5.1 mlr3_0.13.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.7 paradox_0.7.1 lubridate_1.8.0 lattice_0.20-44 [5] listenv_0.8.0 png_0.1-7 palmerpenguins_0.1.0 assertthat_0.2.1 [9] glmnet_4.1-3 digest_0.6.29 foreach_1.5.1 utf8_1.2.2 [13] parallelly_1.28.1 R6_2.5.1 backports_1.4.1 RSQLite_2.2.9 [17] httr_1.4.2 ggplot2_3.3.5 pillar_1.7.0 flock_0.7 [21] rlang_1.0.1 uuid_0.1-4 rstudioapi_0.13 data.table_1.14.2 [25] kernlab_0.9-29 blob_1.2.2 Matrix_1.3-4 checkmate_2.0.0 [29] reticulate_1.22 labeling_0.4.2 splines_4.1.1 gower_0.2.2 [33] RCurl_1.98-1.5 bit_4.0.4 munsell_0.5.0 compiler_4.1.1 [37] pkgconfig_2.0.3 shape_1.4.6 globals_0.14.0 tidyselect_1.1.1 [41] tibble_3.1.6 lgr_0.4.3 mlr3misc_0.9.5 codetools_0.2-18 [45] fansi_1.0.2 future_1.23.0 crayon_1.5.0 dplyr_1.0.7 [49] bitops_1.0-7 rappdirs_0.3.3 grid_4.1.1 jsonlite_1.8.0 [53] gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.2 [57] scales_1.1.1 archivist_2.3.6 stringi_1.7.6 cli_3.2.0 [61] cachem_1.0.6 farver_2.1.0 iBreakDown_2.0.1 ellipsis_0.3.2 [65] generics_0.1.1 vctrs_0.3.8 iterators_1.0.13 tools_4.1.1 [69] bit64_4.0.5 glue_1.6.1 purrr_0.3.4 survival_3.2-13 [73] parallel_4.1.1 fastmap_1.1.0 colorspace_2.0-3 memoise_2.0.0

— Reply to this email directly, view it on GitHub https://github.com/ModelOriented/DALEX/issues/487#issuecomment-1067054211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN2ZZ2JTOK45DWKUY7D5573U75U3DANCNFSM5QSBQ6UQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

hbaniecki commented 2 years ago

What do you mean? this issue is fixed