dsgibbons / shap

A game theoretic approach to explain the output of any machine learning model.
https://shap-community.readthedocs.io/en/latest/
MIT License
25 stars 5 forks source link

LightGBM `TreeExplainer.__call__()` does not work with pandas DataFrame with Categoricals #66

Closed thatlittleboy closed 1 year ago

thatlittleboy commented 1 year ago

Related issue slundberg#2144.

LightGBM errors out if we call explainer(X), but does not error if we call explainer.shap_values(X). Even though explainer(X) itself calls explainer.shap_values(X) internally.

Reproducible example

import shap, pandas as pd, lightgbm

X, y = shap.datasets.adult(n_points=500)
X["categ"] = pd.Categorical(
    [p for p in ("M", "F") for _ in range(250)],
    ordered=False,
)
model = lightgbm.LGBMClassifier(n_estimators=10, n_jobs=1)
model.fit(X, y)

explainer = shap.TreeExplainer(model)
explanation = explainer(X)  # <------ errors

Error message:

ValueError: could not convert string to float: 'M'

Expected result is to not throw an error.

thatlittleboy commented 1 year ago

The reason is because the __call__() converts the datafame to numpy array too eagerly, see L213, before passing to the shap_values method in L218.

https://github.com/dsgibbons/shap/blob/5445ad3b255157bbc5d7d4d10e119c46ebb78676/shap/explainers/_tree.py#L207-L223

When lightgbm receives a numpy array, it just assumes it is an array of floats.

lightgbm actually has a dedicated function _data_from_pandas to prepare the input pandas dataframe into an input ready for the LightGBM model. Which is only called when lightgbm receives a pandas DataFrame.

In particular, categoricals are carefully encoded in this lightgbm function, which we miss out on doing if we just called X.values directly in L213.

I will push a fix later this week.