SauceCat / PDPbox

python partial dependence plot toolbox
http://pdpbox.readthedocs.io/en/latest/
MIT License
840 stars 129 forks source link

Generating plots with sklearn Pipeline objects #59

Open mekomlusa opened 4 years ago

mekomlusa commented 4 years ago

Thanks for creating such a tool for Python partial dependence plot. I do find an issue, though. Right now in my project, the trained model is wrapped as a pipeline. Incoming data would have a handful number of features, and the categorical ones will be transformed into one-hot by preprocessors within the pipeline object. PDPbox works fine when I'm calling the pipeline and a numerical feature that is available in the test dataframe. However, things get interesting when I'm trying to plot a one-hot encoded categorical feature...

  1. Cannot pass the original dataframe and the list of one-hot encoded feature names: the feature names are not found in the dataframe.
  2. Cannot pass the transformed dataframe (by first extracting the preprocessor from the sklearn pipeline and applying it on the data) and the list of one-hot encoded feature names: the package only accepts Pandas dataframe (error message: ValueError: only accept pandas DataFrame)
  3. Cannot pass the original dataframe and the original name of the feature: as the feature is one-hot encoded in the pipeline, plots cannot be generated correctly.

Is there a way to better support sklearn Pipeline object? Ideally, users should be able to pass a pipeline and one-hot encoded feature names as arguments.

hockshem commented 3 years ago

Hello, I'm facing the same problem. Is there any workaround for this issue?

DimitriMisiak commented 3 years ago

Hi, just started using Pdpbox following Kaggle courses. And I stumbled upon the same limitation. My workaround is to not use the pipeline as is, but rather apply the model on the preprocessed data. I manually create the pandas DataFrame from the preprocessor-transformed data. The not-so-trivial part is to recover the name of the features (especially the one created by the OneHotEncoder). For this, I use an edit of the function get_column_names_from_ColumnTransformer of this thread: https://github.com/scikit-learn/scikit-learn/issues/12525#issuecomment-640900712(https://github.com/scikit-learn/scikit-learn/issues/12525#issuecomment-640900712) In the end, my code looks like this:

# building my preprocessor
numerical_transformer = SimpleImputer(strategy='constant')
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
     ('onehot', OneHotEncoder(handle_unknown='error', drop='if_binary'))
])
my_preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
])

# fyi, the structure of my pipeline and training the model
my_pipeline = Pipeline(steps=[
    ('preprocessor', my_preprocessor),
    ('model', my_model)
])
my_pipeline.fit(X_train, y_train)

# preprocessing
my_preprocessor.fit_transform(X_train)
X_valid_transformed = my_preprocessor.transform(X_valid)

# building a valid DataFrame for pdpbox
feature_names = get_column_names_from_ColumnTransformer(my_preprocessor)
X_valid_transformed_pd = pd.DataFrame(X_valid_transformed, columns=feature_names)

# some pdpbox action
pdp_goals = pdp.pdp_isolate(
    model=my_model,
    dataset=X_valid_transformed_pd,
    model_features=feature_names,
    feature='Fare'
)
pdp.pdp_plot(pdp_goals, 'Fare')

I hope this can help !