scikit -learn pipeline (SVC) and .explain_linear_classifier_weights

AbeerAldayel commented 6 years ago

I have the following scikit -learn pipeline using SVCfor multi-classification. When I used

.explain_linear_classifier_weights

I got an error referring to features numbers. Is there a way to interpret scikit-learn pipeline with multiple features?


Book_contents= Pipeline([('selector', ItemSelector(key='Book')),
                         ('tfidf',CountVectorizer(analyzer='word',
                                                  binary=True,
                                                  ngram_range=(1,1))),
                        ])

Author_description= Pipeline([('selector', ItemSelector(key='Description')),
                              ('tfidf', CountVectorizer(analyzer='word',
                                                        binary=True,
                                                        ngram_range=(1,1))),
                             ])

ppl = Pipeline([('feats', FeatureUnion([('Contents',Book_contents),
                                        ('Desc',Author_description)])),
                ('clf', SVC(kernel='linear',class_weight='balanced'))
               ])

model = ppl.fit(training_data, Y_train)

Below the way I used to access the features :

f1=model.named_steps['feats'].transformer_list[0][1].named_steps['tfidf'].get_feature_names()
f2=model.named_steps['feats'].transformer_list[1][1].named_steps['tfidf'].get_feature_names()
    list_features=f1
list_features.append(f2)
explain_weights.explain_linear_classifier_weights(model.named_steps['clf'], 
                                              vec=None, top=20, 
                                              target_names=ppl.classes_, 
                                              feature_names=list_features)

The exact error I got is:

feature_names has a wrong length: expected=47783, got=10528

Also when I pass union features as list combination and use display(eli5.show_weights(model.named_steps['clf'], feature_names=features, target_names=ppl.classes_,top=15)) I got this error

Error: only binary libsvm-based classifiers are supported

kmike commented 6 years ago

This problem is specific to SVC classifier you're using; it is based on libsvm, and it uses OvO approach instead of OvR in multi-class case, which is more tricky to show, so it was not implemented in https://github.com/TeamHG-Memex/eli5/pull/221. You can probably replace SVC(kernel='linear') with sklearn.linear_model.LinearSVC, which should be supported, and should work much faster.

amanpreets01 commented 5 years ago

@AbeerAldayel Can you give me the head of the data ?? This can occur when it takes features one by one and after converting it into Vectorizer it becomes difficult to match with other features.If an entire feature DataFrame is passed, it easily converts into Vector for each feature and fit for SVC

TeamHG-Memex / eli5

scikit -learn pipeline (SVC) and .explain_linear_classifier_weights #277