alegonz / baikal

A graph-based functional API for building complex scikit-learn pipelines.
https://baikal.readthedocs.io
BSD 3-Clause "New" or "Revised" License
592 stars 30 forks source link

Use sklearn.pipeline for make_step #10

Closed ispmarin closed 4 years ago

ispmarin commented 4 years ago

I´m trying to use baikal to create a stacked model for text features, so I created a pipeline with CountVectorizer and TfidfTransformer and passed the pipeline to make_step:

X_train, X_test, y_train, y_test = train_test_split(dfi[feat_var], dfi.binary_label)

text_clf_nb = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('mdl', MultinomialNB()),
])

text_clf_rf = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('mdl', RandomForestClassifier()),
])

text_clf_svc = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('mdl', LinearSVC()),
])

MultinomialNB = make_step(text_clf_nb)
RandomForestClassifier = make_step(text_clf_rf)
LinearSVC = make_step(text_clf_svc)

x = Input()
y_t = Input()
y1 = MultinomialNB()(x, y_t)
y2 = RandomForestClassifier()(x, y_t)
ensemble_features = Concatenate()([y1, y2])
y = LinearSVC()(ensemble_features, y_t)

model = Model(x, y, y_t)

model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print(classification_report(y_test, y_test_pred))

But I´m getting the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-46dbe50df798> in <module>
     20 
     21 
---> 22 MultinomialNB = make_step(text_clf_nb)
     23 RandomForestClassifier = make_step(text_clf_rf)
     24 LinearSVC = make_step(text_clf_svc)

~/lib/venv/risk/lib/python3.6/site-packages/baikal/steps/factory.py in make_step(base_class)
     42 
     43     metaclass = type(base_class)
---> 44     name = base_class.__name__
     45     bases = (Step, base_class)
     46     caller_module = inspect.currentframe().f_back.f_globals["__name__"]

AttributeError: 'Pipeline' object has no attribute '__name__'

Any ideas on how to solve this? Thanks

alegonz commented 4 years ago

Hi there!

The problem here is that make_step takes a class to produce another, but you are passing it an instance of Pipeline.

You could make a step from the Pipeline class, but with baikal you don't need that class as there is an idiomatic way of pipelining a linear sequence of steps. If I understood your snippet correctly here's how I'd do it:

CountVectorizer = make_step(sklearn.feature_extraction.text.CountVectorizer)
TfidfTransformer = make_step(sklearn.feature_extraction.text.CountVectorizer.TfidfTransformer)
MultinomialNB = make_step(sklearn.naive_bayes.MultinomialNB)
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier)
LinearSVC = make_step(sklearn.svm.LinearSVC)

classifiers = (MultinomialNB, RandomForestClassifier, LinearSVC)

x = Input()
y_t = Input()

classfier_outs = []
for classifier in classifiers:
    # Instead of using Pipeline class, do:
    z = CountVectorizer()(x)
    z = TfidfTransformer()(z)
    z = classifier()(z, y_t)    
    classfier_outs.append(z)

ensemble_features = Concatenate()(classifier_outs)
y = LinearSVC()(ensemble_features, y_t)

model = Model(x, y, y_t)

X_train, X_test, y_train, y_test = train_test_split(dfi[feat_var], dfi.binary_label)

model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print(classification_report(y_test, y_test_pred))
ispmarin commented 4 years ago

Thanks! Make total sense now.