csinva / imodels

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
https://csinva.io/imodels
MIT License
1.38k stars 122 forks source link

Understanding the engineered features in Autogluaon. #163

Open vinay-k12 opened 1 year ago

vinay-k12 commented 1 year ago

I was using interpretrable models in autogluaon. While the model training was easier but the challenges is in understand the rules as the rules are created using engineered features and we do not have visibility on the feature engineering. For example, this was rule created when I was running on Lendclub data.

image

There is such value as '11' in 'emp_title'. So, how do we reverse transform the value '11' back to original data?

mglowacki100 commented 1 year ago

I'm not sure if autogluon creates those names, but if you look for quick-fix, you need to one-hot encode categorical variables by yourself:

def dummification(df, col):
  dfz = pd.get_dummies(df[col], prefix=col)
  df = df.drop(columns=[col])
  return pd.concat([df, dfz], axis=1)

...
train_data = train_data.drop(columns='education-num') # education_num is just education encoded by 'ordinal'
categorical = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

for c in categorical:
  train_data = dummification(train_data, c)

with this:

predictor.print_interpretable_rules(model_name='RuleFit_3')
                                                                                                                                                                          rule  coef
                                                                                                                                                                  capital-gain  0.00
                                                                                                                      capital-gain <= 6571.5 and education_ Prof-school <= 0.5 -0.47
                                                                                                                 capital-gain <= 7073.5 and occupation_ Exec-managerial <= 0.5 -0.44
                                                                                               fnlwgt <= 260314.5 and capital-gain <= 7073.5 and education_ Prof-school <= 0.5 -0.19
                                                                                  capital-gain <= 7268.5 and education_ Bachelors <= 0.5 and occupation_ Prof-specialty <= 0.5 -0.36
                                               capital-gain <= 6571.5 and education_ Bachelors <= 0.5 and occupation_ Prof-specialty <= 0.5 and workclass_ Self-emp-inc <= 0.5 -0.85
                                                                                                                                        age <= 42.5 and capital-gain <= 7073.5 -0.14
                                                                                                                                     age <= 38.5 and education_ Masters <= 0.5 -0.38
                                                                       capital-gain <= 7073.5 and marital-status_ Married-civ-spouse <= 0.5 and workclass_ Self-emp-inc <= 0.5 -0.37
                                                                                             age > 27.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 38.5  0.86
                                                                                 age <= 62.5 and age > 27.5 and marital-status_ Married-civ-spouse > 0.5 and race_ White > 0.5  0.47
                                                               age > 29.5 and education_ HS-grad <= 0.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 33.5  0.03
age > 33.5 and education_ 11th <= 0.5 and capital-gain <= 4782.0 and marital-status_ Married-civ-spouse > 0.5 and occupation_ Farming-fishing <= 0.5 and hours-per-week > 37.5  0.07
                                                                                             age > 42.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 28.5  0.25
              age <= 52.0 and age > 27.5 and fnlwgt > 134350.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 32.5 and occupation_ Machine-op-inspct <= 0.5  0.64
     fnlwgt > 104201.0 and capital-gain <= 7268.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 35.5 and workclass_ ? <= 0.5 and workclass_ Private <= 0.5  0.17

where for categorical >0.5 means True, <=0.5 means False

csinva commented 1 year ago

Thanks @mglowacki100! I agee I think one-hot encoding is the best way to go for now.

That feature engineering is performed by autogluon not imodels. There isn't currently support for inverse transforming back to the original features, but we will try and add it soon!

vinay-k12 commented 1 year ago

Thought of that but was thinking that this would increase training time hugely. But anyways, I'll run it on limited features.

mglowacki100 commented 1 year ago

Hi @csinva, I see you're autogluon contributor, so two additional things regarding interpretable:

Innixma commented 1 year ago

This a tricky situation, as I don't think it is possible for the categorical feature rules to display meaningful information in a low-split-count model without one-hot-encoding them, since we use label encoding where a tree model split is nearly impossible to interpret. However, you probably pay a huge performance and accuracy penalty by one-hot-encoding.

@csinva in https://github.com/autogluon/autogluon/pull/2981 I am moving the interpretable logic into its own class called InterpretableTabularPredictor where I disable models like the weighted ensemble and post-hoc calibration that would corrupt the interpretable aspects of the models. One option would be to implement a custom feature generator that includes a 1-hot-encoding stage for all categoricals. I'm unsure if this would be a satisfying solution, so I'd like to hear your thoughts.

csinva commented 1 year ago

Thanks, I think one-hot encoding categorical variables is a decent solution, as it should atleast preserve interpretability.