ModelOriented / ingredients

Effects and Importances of Model Ingredients
https://modeloriented.github.io/ingredients/
GNU General Public License v3.0
37 stars 18 forks source link

make pdp and cp work with NA in data #120

Closed hbaniecki closed 3 years ago

hbaniecki commented 3 years ago

crossref https://github.com/ModelOriented/modelStudio/issues/71

pbiecek commented 3 years ago

I've could not find a reproducible example, @hbaniecki would you check if this is solved?

I've checked this with

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                        data = na.omit(titanic_imputed))
titanic_imputed[2:1000,2] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                              data = titanic_imputed[,-8],
                              y = titanic_imputed[,8],
                              verbose = FALSE)
pdp_glm <- partial_dependence(explain_titanic_glm,
                             N = 25, variables = c("age", "fare","sibsp"),
                             variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
 plot(pdp_glm)
hbaniecki commented 3 years ago

I guess that after the fix it works

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                                  data = na.omit(titanic_imputed))
titanic_imputed[2:1000,2] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_imputed[,-8],
                               y = titanic_imputed[,8],
                               verbose = FALSE)
pdp_glm <- partial_dependence(explain_titanic_glm,
                              N = 25, variables = c("age", "fare","sibsp"))
#, variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
plot(pdp_glm)
pbiecek commented 3 years ago

thanks

p-schaefer commented 1 year ago

Hi there,

I'm wondering if there is some way of making conditional and accumulated dependence plots work with NAs? i,e,

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                                  data = na.omit(titanic_imputed))
titanic_imputed[2:1000,2] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_imputed[,-8],
                               y = titanic_imputed[,8],
                               verbose = FALSE)
pdp_glm <- conditional_dependence(explain_titanic_glm,
                              N = 25, variables = c("age", "fare","sibsp"))
#, variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
plot(pdp_glm)

Thanks

hbaniecki commented 1 year ago

Hi, what is your goal? PD/ALE rely on estimating expected predictions with respect to data distribution.

Did you consider removing observations without age (with NAs) from data to estimate the explanation of age?

p-schaefer commented 1 year ago

Sorry, this was a bad example. I was piggybacking on the example from this thread. In doing more testing with reasonable numbers of NAs, I see that conditional_dependence() does work with NAs:

library("DALEX")
library("ingredients")
library("randomForest")

model_titanic_glm <- randomForest(survived ~ gender + age + fare,
                                  data = na.omit(titanic_imputed))

toNA<-sample(1:1000,10)

titanic_imputed[toNA,] = NA
explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_imputed[,-8],
                               y = titanic_imputed[,8],
                               verbose = FALSE)
pdp_glm <- conditional_dependence(explain_titanic_glm,
                                  N = 25, variables = c("age", "fare","sibsp"))
#, variable_splits = list(age = seq(0,100,0.1), fare = c(0:100), sibsp=0:10))
plot(pdp_glm)

Unfortunately, in my significantly larger and more complicated models, I'm running into issues related to missing values where the aggregated profiles aren't being calculated. When I impute the missing values, there are no issues. But I can't seem to recreate it with a simpler dataset/model. Do you know of any situations where aggregating profiles fails elated to NAs? There are no instances where an entire column is NAs like in my previous examples.