TeamHG-Memex / eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions
http://eli5.readthedocs.io
MIT License
2.75k stars 331 forks source link

Complex Pipeline process & show_prediction #213

Open armgilles opened 7 years ago

armgilles commented 7 years ago

Hi

I'm strangling to try to use show_prediction with a more complex pipeline and heterogeneous data... I know it is a pretty hot topic in Scikit & Eli5.

I would like to use it like your exemple in Titanic Dataset but with more than one column with text

#X_train & X_test are DataFrames

count_vec_txt_1 = CountVectorizer(analyzer='word', max_features=75)
count_vec_txt_2 = CountVectorizer(analyzer='word', max_features=35)

clf = Pipeline([
        ('union', FeatureUnion(
                    transformer_list = [
                        ('cst',  cust_regression_vals()),     # Already did some features engenring, so I just keep it 
                        ('text_feature_1', Pipeline([
                            ('text_feature_1', cust_txt_col(key='text_feature_1')), # Selector
                            ('count_vec_txt_1', count_vec_txt_1)
                        ])),
                        ('text_feature_2', Pipeline([
                            ('text_feature_2', cust_txt_col(key='text_feature_2')), # Selector
                            ('count_vec_txt_2', count_vec_txt_2)
                        ])),

                    ]
        )),
        ('algo', xgb_model)
    ])

# Learning
clf.fit(X_train, y_train)

## My Goal now is to get all my features names (no get_feature_names() yet)
# get feature name with text transformer :
features =  X_train.columns.tolist()

# Remove feature with Text processing
for col in ['text_feature_1', 'text_feature_2']:
    features .remove(col)

count_vec_txt_1.fit(X_train.text_feature_1)
features_xgb.extend(count_vec_txt_1.get_feature_names())

count_vec_txt_2.fit(X_train.text_feature_2)
features_xgb.extend(count_vec_txt_2.get_feature_names())
# I got all my features name now

# Want to debug some rows (curiosity etc...)

eli5.show_prediction(clf, X_test[X_test.index == 42], feature_names=feature)
# ERROR
Error: estimator Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('cst', 
cust_regression_vals()), ('text_feature_1', Pipeline(steps=[('text_feature_1', 
cust_txt_col(key='text_feature_1')), ('count_vec_txt_1', CountVectorizer(analyzer='word', 
binary=False, ........))]) is not supported 

I try many things but I'm stuck here...

lopuhin commented 7 years ago

@armgilles wow, thanks for a great example - it would be great to try applying eli5 in this case. Could you post a complete notebook somewhere, if it's convenient to you?

armgilles commented 7 years ago

Sure, but I can't share data... I can use the classic fetch_20newsgroups if it's ok for you ?

lopuhin commented 7 years ago

@armgilles ah sorry, I thought it's based directly on the titanic tutorial. I thnk what you provided is already enough.

kmike commented 7 years ago

Currently Pipeline support is not implemented for explain_prediction - it is implemented only for explain_weigths; that's the reason https://github.com/TeamHG-Memex/eli5/issues/15 is still open.

Could you try passing clf.named_steps['algo'] as an estimator and clf.named_steps['union'] as vec? Does it work?

eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
    feature_names=feature, vec=clf.named_steps['union'])
armgilles commented 7 years ago

Nop it doesn't work :

eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
    feature_names=feature, vec=clf.named_steps['union'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-60-34ed186ec620> in <module>()
      1 eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
----> 2                      feature_names=features, vec=clf.named_steps['union'])

/root/anaconda2/lib/python2.7/site-packages/eli5/ipython.pyc in show_prediction(estimator, doc, **kwargs)
    261     """
    262     format_kwargs, explain_kwargs = _split_kwargs(kwargs)
--> 263     expl = explain_prediction(estimator, doc, **explain_kwargs)
    264     html = format_as_html(expl, **format_kwargs)
    265     return HTML(html)

/root/anaconda2/lib/python2.7/site-packages/singledispatch.pyc in wrapper(*args, **kw)
    208 
    209     def wrapper(*args, **kw):
--> 210         return dispatch(args[0].__class__)(*args, **kw)
    211 
    212     registry[object] = func

/root/anaconda2/lib/python2.7/site-packages/eli5/xgboost.pyc in explain_prediction_xgboost(xgb, doc, vec, top, top_targets, target_names, targets, feature_names, feature_re, feature_filter, vectorized)
    123     Weights of all features sum to the output score of the estimator.
    124     """
--> 125     xgb_feature_names = xgb.booster().feature_names
    126     vec, feature_names = handle_vec(
    127         xgb, doc, vec, vectorized, feature_names, num_features=len(xgb_feature_names))

TypeError: 'str' object is not callable

Some informations :

clf.named_steps['algo']
# Return
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05.......)
clf.named_steps['union']
# Return 
FeatureUnion(n_jobs=1,
       transformer_list=[('cst', cust_regression_vals()), ('text_feature_1', Pipeline(steps=[('text_feature_1', cust_txt_col(key='text_feature_1')), ('count_vec_txt_1', CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u...trip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None))]))],
       transformer_weights=None)
armgilles commented 7 years ago

Hey I created a notebook with Titanic dataset with this kind of pipeline.

I use a specific function to build some features and then apply my pipeline process.

If I can help for anything

lopuhin commented 7 years ago

Thanks for an example @armgilles !

Actually, this last error is related to how we handle pandas dataframes. Currently we assume that vectorizer is able to handle a list of inputs as it's input, but this is not correct in this case. A way to make your example work with current eli5 is to pass an already vectorized document:

eli5.show_prediction(clf.named_steps['algo'], 
                     clf.named_steps['union'].transform(X_test[X_test.index == 809]), 
                     feature_names=features)

this gives an exaplanation: image

There is also a way to make your original example work (a9ec021), but I'm not sure it's consistent with our API: currently we always advise to pass a single document, not a container of length 1. To be fair, passing X_test[X_test.index == 809].iloc[0] instead of X_test[X_test.index == 809] also fails currently. So it requires more thought about the API we advertise, and more pandas support probably - cause it seems natural to have vectorizer operate on pandas dataframes.

armgilles commented 7 years ago

Thank @lopuhin for reply.

I update my notebook and add some comments.

I wish I could help with PR, but i'm not in my confort zone here. Maybe help you with some exemples and documentation.

armgilles commented 7 years ago

I have a strange bug in this notebook, when I fit my model (simple xgboost, no pipeline). I predict a line with eli5.show_prediction

image

y=1 is wrong here (0.061 proba), it should be y=0

If I force targets in eli5.show_prediction with xgb_model_1.classes_ (array([0, 1])), it's the same result :

image

To fix it I have to set targets= [1, 0] :

image

Did I miss something ?

I could open a new issue for better understanding.

lopuhin commented 7 years ago

@armgilles currently y=1 is shown for binary classifiers in any case, but @kmike is working on this issue: #223

kmike commented 7 years ago

@armgilles if you have binary classification task with class names (e.g. "red' and "blue") it is not that bad - y="red" (probability=0.061) kind of makes sense. So currently y=1 (probability=0.061) should be read as "y=1 with probability 0.061". But as @lopuhin said, it'll be fixed.

sathyz commented 5 years ago

I'm trying a simple pipeline and show_prediction() and it is failing.

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

import eli5

fnames = ["sepal_length", "sepal_width", "petal_length", "petal_width",]
tnames = ["Setosa", "Versicolour", "Virginica"]

Xs, ys = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(Xs, ys, shuffle=True, test_size=0.2)

scaler = StandardScaler()
lr = LogisticRegression()
pipeline = make_pipeline( scaler, lr)
pipeline.fit(X_train, y_train)

random_sample = np.random.randint(len(X_test))
doc = X_test[random_sample]
eli5.show_prediction(pipeline, doc, feature_names=fnames, target_names=tnames)

The error is,

Error: estimator Pipeline(memory=None, steps=[('standardscaler', StandardScaler(copy=True, 
with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, 
penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]) is not
supported

I did the following to get it work,

doc_raw = np.expand_dims(X_test[random_sample], axis=0)
doc = np.squeeze( scaler.transform(doc_raw) )
eli5.explain_prediction(lr, doc, feature_names=fnames, target_names=tnames)

Version: 0.8