Scikit-learn Pipeline support

lopuhin commented 7 years ago

Especially if pipeline contains feature selection - right now if we take a vectorizer and a classifier with feature selection in between and pass them to explain_weights, then weights shown would be incorrect.

kmike commented 7 years ago

See https://github.com/scikit-learn/scikit-learn/issues/6425, https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/pull/2007 and all related tickets. I think it is better to be implemented in scikit-learn.

lopuhin commented 7 years ago

Yeah, pipelines are a complicated beast :)

I've got one use-case that seems impossible to explain with current eli5, but maybe it's too specific. Say I have the following pipeline: CountVectorizer, TfidfTransformer, some classifier. I don't want to replace CountVectorizer and TfidfTransformer with TfidfVectorizer because CountVectorizer takes a while to run and I cache it's results, but I can't do the same with TfidfVectorizer because it's not entirely fair to train it on test data (but it's fine for CountVectorizer). I can explain weights of this pipeline by passing the classifier and feature names directly. But I can't explain predictions with feature highlighting in text: without feature highlighting I could pass the document passed through first 2 stages of the pipeline, but this won't work for text feature highlighting because it needs a text doc. A possible solution is to add limited pipeline support to explain_prediction: the whole pipeline is used to get predictions, but the classifier is extracted to get the weights. Or alternatively I'd like to be able to pass both text and vectorized document: this does not involve pipelines at all and is easier to explain.

lopuhin commented 7 years ago

Hm, ok, it's definitely not impossible to support this with current eli5: I just define a new vectorizer that wraps existing CountVectorizer and TfidfTransformer (or just hack TfidfVectorizer). Or in general I can always create a vectorizer that does what I want and pass it to eli5.

lopuhin commented 7 years ago

I can always create a vectorizer that does what I want and pass it to eli5.

But it's not trivial: you have to define all methods eli5 might expect from the text vectorizer for text highlighting to work, and also subclass (this can be relaxed), so one way is

from sklearn.feature_extraction.text import VectorizerMixin

class BeastVectorizer(VectorizerMixin):
    def __init__(self, vec, pipeline):
        self.vec = vec
        self.pipeline = pipeline

    def __getattr__(self, attr):
        if 'transform' in attr:
            obj = self.pipeline
        else:
            obj = self.vec
        return getattr(obj, attr)

(it can still be subtly wrong due to subclassing from VectorizerMixin)

kmike commented 7 years ago

It'd be great to support pipelines directly - it should be possible to pass pipeline in estimator argument, without passing vec; custom vectorizers look like a hack.

So, from the pipeline we need to get:

Coefficients of the estimator. I think we can take the last step of a pipeline and assume it is an estimator. If it is not a supported estimator, raise an error.
Target names of the estimator (classes_ attribute). It is the same as (1).
Feature names. This is tricky; it looks like the best place to solve it is scikit-learn itself. As I understand it, the idea is to add get_feature_name() method to Pipeline; this method would call first step.get_feature_names(), then pass the result to get_feature_names() method of the next step, etc., and then return the result for the last (or last-1?) step. Various transformers should be fixed to support this new get_feature_names signature, then get_feature_names support can be added to the pipeline. I don't think we should implement it in eli5 (maybe contributing to scikit-learn is an option). But supporting a few common hardcoded cases in eli5 sounds fine to me.
Text processing method. We can extract one of the supported vectorizers from the pipeline, but then we should apply next pipeline steps to the result (e.g. in your example - change weights produced by CountVectorizer using TfIdfTransformer), so this is also tricky. It must be possible to do that, but I'm not sure how to structure the code.

It'd be also nice to support pipelines where CountVectorizer is not the first step. For example, once can create a pipeline which takes dicts as an input, then gets a field with the text from the dict using FunctionTransformer, and then applies CountVectorizer to this text. It looks possible to support this use case as well.

Regarding caching CountVectorizer results - with default arguments it should be fine, but there still can be information leaks if min_df / max_df / max_features arguments are used.

lopuhin commented 7 years ago

re "3. Feature names." - I think passing explicit feature_names is not too hard to do, so maybe we can skip this step for now. re "4. Text processing method." - I agree that this one is tricky. I think it's similar in spirit to what was required to make FeatureUnion work, and it might again require a lot of changes.

jnothman commented 7 years ago

I think eli5 is in a privileged position to hack a solution to feature names and no, @lopuhin. I'm not sure in what timeframe it will be fixed in scikit-learn, and I think eli5 could experiment with an interface with more flexibility.

If I were you, as a first hack, I'd take the likes of my monkey patch and rewrite it as a function rather than methods.

(A thought: In order to only extract feature names for a specific subset of features, each transformer needs to describe the dependencies from input to output.)

kmike commented 7 years ago

@jnothman a good point about interface experiments; eli5 definitely can allow to experiment with API more than scikit-learn, it is not such a big deal to break things at this stage in eli5. #158 looks like a great start, I like your design.

TeamHG-Memex / eli5

Scikit-learn Pipeline support #15