ModelOriented / DALEXtra

Extensions for the DALEX package
https://ModelOriented.github.io/DALEXtra/
66 stars 10 forks source link

How to use predict_function_target_column in explain_xgboost function? #63

Closed asheetal closed 3 years ago

asheetal commented 3 years ago

I have an xgboost model with 7 class label. The guide says predict_function_target_column: Character or numeric containing either column name or column number in the model prediction object of the class that should be considered as positive

What is the range of values I can use for predict_function_target_column. I was hoping the function to throw an error if I set predict_function_target_column = 999 It did not, so not sure how to use this option.

maksymiuks commented 3 years ago

Hi,

for what type of model did you use that parameter? Can you provide me some code, please?

asheetal commented 3 years ago

Thanks for replying back so quick. The model is xgboost. multiclass using it on explain_xgboost function call. I can create an iris example if you need it. Essentially I just need to know how to use the parameter predict_function_target_column. Is it 0 based, is it 1 based? Is it working? I put a random number and the function does not throw an error.

maksymiuks commented 3 years ago

Hi,

I evaluated some examples, here they are:

library(DALEXtra)
library(mlr)
library(xgboost)

data <- as.matrix(createDummyFeatures(HR[,-6]))
model <- xgboost(data, as.numeric(HR$status)-1, nrounds = 10,
                 params = list(objective = "multi:softprob"),
                 prediction = TRUE, num_class = 3)

explain_xgboost(model, data, as.numeric(HR$status)-1)

Here is the console output

> explain_xgboost(model, data, as.numeric(HR$status)-1)
Preparation of a new explainer is initiated
  -> model label       :  xgb.Booster  (  default  )
  -> data              :  7847  rows  6  cols 
  -> target variable   :  7847  values 
  -> predict function  :  yhat.xgb.Booster  will be used (  default  )
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package xgboost , ver. 1.1.1.1 , task multiclass (  default  ) 
  -> model_info        :  Model info detected multiclass task but 'y' is a numeric .  (  WARNING  )
  -> model_info        :  By deafult classification tasks supports only factor 'y' parameter. 
  -> model_info        :  Consider changing to a factor vector with true class names.
  -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
  -> predicted values  :  predict function returns multiple columns:  3  (  default  ) 
  -> residual function :  difference between 1 and probability of true class (  default  )
  -> residuals         :  numerical, min =  0.0417468 , mean =  0.3879627 , max =  0.9757479  
  A new explainer has been created!  

Now let us set the parameter:

explain_xgboost(model, data, as.numeric(HR$status)-1, predict_function_target_column = 2)

According to expectations mode was switched from multiclass to binary with the second column as positive one:

Preparation of a new explainer is initiated
  -> model label       :  xgb.Booster  (  default  )
  -> data              :  7847  rows  6  cols 
  -> target variable   :  7847  values 
  -> predict function  :  yhat.xgb.Booster  will be used (  default  )
  -> predicted values  :  Predict function column set to:  2 (  OK  )
  -> model_info        :  package xgboost , ver. 1.1.1.1 , task classification (  default  ) 
  -> predicted values  :  numerical, min =  0.02040156 , mean =  0.2808482 , max =  0.8666626  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -0.75708 , mean =  0.7084471 , max =  1.979598  
  A new explainer has been created! 

Finally, let us choose a parameter out of bounds

explain_xgboost(model, data, as.numeric(HR$status)-1, predict_function_target_column = 4)

and the output indeed shows that there was an error in prediction fucntion:

Preparation of a new explainer is initiated
  -> model label       :  xgb.Booster  (  default  )
  -> data              :  7847  rows  6  cols 
  -> target variable   :  7847  values 
  -> predict function  :  yhat.xgb.Booster  will be used (  default  )
  -> predicted values  :  Predict function column set to:  4 (  OK  )
  -> model_info        :  package xgboost , ver. 1.1.1.1 , task classification (  default  ) 
  -> predicted values  :  the predict_function returns an error when executed (  WARNING  ) 
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  the residual_function returns an error when executed (  WARNING  ) 
    A new explainer has been created!  

So in theory everything should work fine. Can you please share your session info? Also please keep in mind that explain function will almost never return an error. It was designed in a way to extract whatever is possible even if the call is faulty in some places. Perhaps it misguided you, explainer will create even if you set absurd parameters.

asheetal commented 3 years ago

Thanks a lot for spending time to create a scenario to help me. I see the same behavior as yours at my end with your code (i.e. no loss with bad column number). However in the research setup, I see same numerical loss value regardless of column value. The only difference i can think of is that our research model is using GPU based preductions. Let me debug this a little more. This is how the function is called in my code.

 explainer_xgb <- explain_xgboost(model = fit_all, 
                                     data = as.matrix(df.X), 
                                     y = df.Y - 1, 
                                     weights = df.weights,
                                     true_labels = df.true_label, 
                                     verbose = TRUE, 
                                     predict_function_target_column = 99999, # this is an illegal column value
                                     label = "xgboost model")
maksymiuks commented 3 years ago

If data is non-confidential you can try sending me them and your call via email so we can try to debug it together. Otherwise please call debug(DALEXtra:::yhat.xgb.Booster) and try to find out what happens (what lines are evaluated and what is the output)

asheetal commented 3 years ago

The feature got redundant in the research project so we had to go a different approch to get explanations and column selection was not needed. Closing this topic. Thank you for your help.