Open armgilles opened 7 years ago
@armgilles wow, thanks for a great example - it would be great to try applying eli5 in this case. Could you post a complete notebook somewhere, if it's convenient to you?
Sure, but I can't share data... I can use the classic fetch_20newsgroups
if it's ok for you ?
@armgilles ah sorry, I thought it's based directly on the titanic tutorial. I thnk what you provided is already enough.
Currently Pipeline support is not implemented for explain_prediction - it is implemented only for explain_weigths; that's the reason https://github.com/TeamHG-Memex/eli5/issues/15 is still open.
Could you try passing clf.named_steps['algo']
as an estimator and clf.named_steps['union']
as vec
? Does it work?
eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42],
feature_names=feature, vec=clf.named_steps['union'])
Nop it doesn't work :
eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42],
feature_names=feature, vec=clf.named_steps['union'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-60-34ed186ec620> in <module>()
1 eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42],
----> 2 feature_names=features, vec=clf.named_steps['union'])
/root/anaconda2/lib/python2.7/site-packages/eli5/ipython.pyc in show_prediction(estimator, doc, **kwargs)
261 """
262 format_kwargs, explain_kwargs = _split_kwargs(kwargs)
--> 263 expl = explain_prediction(estimator, doc, **explain_kwargs)
264 html = format_as_html(expl, **format_kwargs)
265 return HTML(html)
/root/anaconda2/lib/python2.7/site-packages/singledispatch.pyc in wrapper(*args, **kw)
208
209 def wrapper(*args, **kw):
--> 210 return dispatch(args[0].__class__)(*args, **kw)
211
212 registry[object] = func
/root/anaconda2/lib/python2.7/site-packages/eli5/xgboost.pyc in explain_prediction_xgboost(xgb, doc, vec, top, top_targets, target_names, targets, feature_names, feature_re, feature_filter, vectorized)
123 Weights of all features sum to the output score of the estimator.
124 """
--> 125 xgb_feature_names = xgb.booster().feature_names
126 vec, feature_names = handle_vec(
127 xgb, doc, vec, vectorized, feature_names, num_features=len(xgb_feature_names))
TypeError: 'str' object is not callable
Some informations :
clf.named_steps['algo']
# Return
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.05.......)
clf.named_steps['union']
# Return
FeatureUnion(n_jobs=1,
transformer_list=[('cst', cust_regression_vals()), ('text_feature_1', Pipeline(steps=[('text_feature_1', cust_txt_col(key='text_feature_1')), ('count_vec_txt_1', CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u...trip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None))]))],
transformer_weights=None)
Hey I created a notebook with Titanic dataset with this kind of pipeline.
I use a specific function to build some features and then apply my pipeline process.
If I can help for anything
Thanks for an example @armgilles !
Actually, this last error is related to how we handle pandas dataframes. Currently we assume that vectorizer is able to handle a list of inputs as it's input, but this is not correct in this case. A way to make your example work with current eli5 is to pass an already vectorized document:
eli5.show_prediction(clf.named_steps['algo'],
clf.named_steps['union'].transform(X_test[X_test.index == 809]),
feature_names=features)
this gives an exaplanation:
There is also a way to make your original example work (a9ec021), but I'm not sure it's consistent with our API: currently we always advise to pass a single document, not a container of length 1. To be fair, passing X_test[X_test.index == 809].iloc[0]
instead of X_test[X_test.index == 809]
also fails currently. So it requires more thought about the API we advertise, and more pandas support probably - cause it seems natural to have vectorizer operate on pandas dataframes.
Thank @lopuhin for reply.
I update my notebook and add some comments.
I wish I could help with PR, but i'm not in my confort zone here. Maybe help you with some exemples and documentation.
I have a strange bug in this notebook, when I fit my model (simple xgboost, no pipeline). I predict a line with eli5.show_prediction
y=1 is wrong here (0.061 proba), it should be y=0
If I force targets
in eli5.show_prediction
with xgb_model_1.classes_
(array([0, 1])), it's the same result :
To fix it I have to set targets= [1, 0]
:
Did I miss something ?
I could open a new issue for better understanding.
@armgilles currently y=1 is shown for binary classifiers in any case, but @kmike is working on this issue: #223
@armgilles if you have binary classification task with class names (e.g. "red' and "blue") it is not that bad - y="red" (probability=0.061)
kind of makes sense. So currently y=1 (probability=0.061)
should be read as "y=1 with probability 0.061". But as @lopuhin said, it'll be fixed.
I'm trying a simple pipeline and show_prediction()
and it is failing.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import eli5
fnames = ["sepal_length", "sepal_width", "petal_length", "petal_width",]
tnames = ["Setosa", "Versicolour", "Virginica"]
Xs, ys = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(Xs, ys, shuffle=True, test_size=0.2)
scaler = StandardScaler()
lr = LogisticRegression()
pipeline = make_pipeline( scaler, lr)
pipeline.fit(X_train, y_train)
random_sample = np.random.randint(len(X_test))
doc = X_test[random_sample]
eli5.show_prediction(pipeline, doc, feature_names=fnames, target_names=tnames)
The error is,
Error: estimator Pipeline(memory=None, steps=[('standardscaler', StandardScaler(copy=True,
with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]) is not
supported
I did the following to get it work,
doc_raw = np.expand_dims(X_test[random_sample], axis=0)
doc = np.squeeze( scaler.transform(doc_raw) )
eli5.explain_prediction(lr, doc, feature_names=fnames, target_names=tnames)
Version: 0.8
Hi
I'm strangling to try to use
show_prediction
with a more complex pipeline and heterogeneous data... I know it is a pretty hot topic in Scikit & Eli5.I would like to use it like your exemple in Titanic Dataset but with more than one column with text
I try many things but I'm stuck here...