bgreenwell / pdp

A general framework for constructing partial dependence (i.e., marginal effect) plots from various types machine learning models in R.
http://bgreenwell.github.io/pdp
93 stars 12 forks source link

How to get Partial Dependence of target encoding based categorical feature #88

Closed kaoribundo closed 5 years ago

kaoribundo commented 5 years ago

issue : 'pdp' library is very very useful, but I have a problem in one case.

In order to fit xgboost model, I have translated categorical features to numeric feature by useing Target encoding . Then I used 'partial' function with 'cats' arguments in the feature. I supposed yhat correspond to each target encode value, but the feature value from partial was different. (The original value and partial value did not match. )

This is the example codes when I faced to the problem. Please tell me how to solve this problem.

## Load Required packages
library(dplyr)
library(xgboost)
library(pdp) 

## data (example)
## real data has more features 
> head(data_example)
  objective numeric_feature one_hot_encoding_feature target_encoding_feature
1         1             392                        0            6.077463e-05
2         1             765                        0            2.891865e-03
3         1             643                        0            3.254317e-03
4         0             330                        0            5.517329e-05
5         0             194                        0            1.075839e-05
6         0             194                        0            1.372488e-05

## Modeling
### translate dataframe to xgb.DMatrix
train_data <- xgb.DMatrix(as.matrix(
    dplyr::select(data_example , -objective)))

### Fit Xgboost Model
xgb_model <- xgboost(
    data = train_data
    ,param = list("objective" = "binary:logistic")
)

## Get Partial Dependence of each feature
## numeric_feature
## it worked !!
p_numeric <- partial(xgb_model
            ,pred.var = "numeric_feature"
            ,train = as.matrix(data_example[,-"objective"])
            ,type = "regression"
)
>head(p_numeric)
 numeric_feature       yhat
1           0.00 0.03914417
2          30.38 0.03359368
3          60.76 0.03358342
4          91.14 0.03340583
5         121.52 0.03329919
6         151.90 0.03330897

## one_hot_encoding_feature
## it worked !!
p_onehot <- partial(xgb_model
            ,pred.var = "one_hot_encoding_feature"
            ,train = as.matrix(data_example[,-"objective"])
            ,type = "regression"
            ,cat = "one_hot_encoding_feature"
)
>p_onehot
 one_hot_encoding_feature       yhat
1                       0 0.03315646
2                       1 0.03301551

## target_encoding_feature
## faced to the problem !
p_onehot <- partial(xgb_model
            ,pred.var = "target_encoding_feature"
            ,train = as.matrix(data_example[,-"objective"])
            ,type = "regression"
            ,cat = "target_encoding_feature"
)
> head(p_target)
  target_encoding_feature       yhat
1            1.075839e-05 0.02349093
2            9.605801e-04 0.05872103
3            1.910402e-03 0.05886231
4            2.860224e-03 0.06028017
5            3.810045e-03 0.06624583
6            4.759867e-03 0.06624583

## comapare original target encoding value and partial output
## I would like to get yhat correspond to original value
>sort(unique(data_example$target_encoding_feature))
 [1] 1.075839e-05 1.294359e-05 1.360556e-05 1.372488e-05 1.468446e-05 1.509768e-05 1.766756e-05 5.517329e-05
 [9] 6.077463e-05 6.478200e-05 6.573776e-05 7.262987e-05 7.770370e-05 7.959780e-05 8.514537e-05 8.717332e-05
[17] 1.070893e-04 1.175257e-04 1.351717e-04 1.626339e-04 1.948620e-04 1.994245e-04 2.062776e-04 2.140787e-04
[25] 2.141661e-04 2.166361e-04 2.241656e-04 2.481869e-04 3.676495e-04 3.923796e-04 4.283972e-04 4.383589e-04
[33] 4.499127e-04 4.738567e-04 5.141846e-04 7.232350e-04 7.588255e-04 7.852622e-04 8.776138e-04 9.964129e-04
[41] 1.066354e-03 1.074656e-03 2.891865e-03 2.905396e-03 3.172273e-03 3.237116e-03 3.254317e-03 3.308820e-03
[49] 3.401120e-03 9.411765e-03 9.624639e-03 1.082056e-02 1.123596e-02 1.181525e-02 1.377727e-02 1.910828e-02
[57] 2.047981e-02 2.286483e-02 2.544910e-02 2.588556e-02 2.633559e-02 2.978723e-02 3.880901e-02 4.027976e-02
[65] 4.114286e-02 4.155194e-02 4.496066e-02 4.574758e-02 4.750185e-02

>p_target$target_encoding_feature
 [1] 1.075839e-05 9.605801e-04 1.910402e-03 2.860224e-03 3.810045e-03 4.759867e-03 5.709689e-03 6.659511e-03
 [9] 7.609332e-03 8.559154e-03 9.508976e-03 1.045880e-02 1.140862e-02 1.235844e-02 1.330826e-02 1.425808e-02
[17] 1.520791e-02 1.615773e-02 1.710755e-02 1.805737e-02 1.900719e-02 1.995702e-02 2.090684e-02 2.185666e-02
[25] 2.280648e-02 2.375630e-02 2.470612e-02 2.565595e-02 2.660577e-02 2.755559e-02 2.850541e-02 2.945523e-02
[33] 3.040505e-02 3.135488e-02 3.230470e-02 3.325452e-02 3.420434e-02 3.515416e-02 3.610398e-02 3.705381e-02
[41] 3.800363e-02 3.895345e-02 3.990327e-02 4.085309e-02 4.180292e-02 4.275274e-02 4.370256e-02 4.465238e-02
[49] 4.560220e-02 4.655202e-02 4.750185e-02

Thank you for reading this issue.

bgreenwell commented 5 years ago

Hi @kaoribundo. Perhaps I'm not fully understanding the issue. Why would you expect the partial dependence (i.e., yhat) to match the encoded feature values?

kaoribundo commented 5 years ago

Hello @bgreenwell . Thank you for your reply. And sorry for my poor explanation ...

For example , I used age_range feature and would like to understand the partial dependence of each category (like 10's , 20's and so on ...) But I translated this categorical feature as numeric feature by using target based encoding, (using the occurance probability toward the objective feature of each category , like 10's → 0.25, 20's → 0.5) and fit the xgboost model .

In order to interpret the behaviour of the model on each categorical feature (like 10's and 20's) , I think we should know the yhat of 0.25 , 0.5.

I have few knowledge of interpretable machine learning , so maybe I am saying unreasonable thing ...

kaoribundo commented 5 years ago

Hi @bgreenwell

I solved my problem by using 'pred.grid' arugments , and get the yhat which matched the target encoded categorical feature .

I looked your code and understand that if 'pred.grid' arguments is missing , the 'pred.val' function compensate the pred.grid values. And also when 'cats' arguments is used, the unique values of the feature would be used in pred.grid. https://github.com/bgreenwell/pdp/blob/a7f755ceffd0d2ab5da684869ec1f40f882d2bab/R/pred_grid.R#L25-L27

This seems to work for numeric features , but I could not get the results which I expected. But anyway, I got the expected results by using 'pred.grid' arugments !

Thank you for your time , and I will close this issue.