feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

Pipelines using pandas transform output do not work with feature creation #670

Closed admivsn closed 1 year ago

admivsn commented 1 year ago

Describe the bug Pipelines break when you use pandas transform output with feature creation.

To Reproduce Steps to reproduce the behavior:

import pandas as pd

from feature_engine.creation import RelativeFeatures
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyClassifier
from sklearn import set_config

set_config(transform_output="pandas")

X = pd.DataFrame({"feature_1": [1, 2, 3, 4, 5], "feature_2": [6, 7, 8, 9, 10]})
y = pd.Series([0, 1, 0, 1, 0])

pipeline = make_pipeline(
    RelativeFeatures(
        variables=["feature_1"],
        reference=["feature_2"],
        func=["div"]
    ),
    DummyClassifier()
)

pipeline.fit(X, y)
pipeline.predict(X)
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
solegalli commented 1 year ago

How annoying!

Thank you for reporting @admivsn

I need to look into the inner workings of the pipeline.

By the looks of the error, the pipe is surprised to get the extra feature created by the RelativeFeatures.

I'll see if I can dig some time over the weekend.

Cheers

ClaudioSalvatoreArcidiacono commented 1 year ago

The root cause of this issue is that at this line here:

class RelativeFeatures(BaseCreation):
    ...
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = super().transform(X)

The transform method of BaseCreation is called. This method will then call the get_features_out method defined in GetFeatureNamesOutMixin which will return the feature names out of the child class RelativeFeatures instead of the parent class BaseCreation and since BaseCreation.transform(self, X) returns a DataFrame with 2 columns and get_features_out returns 3 columns this mismatch causes the error described in the issue.

solegalli commented 1 year ago

Yes, I just figured that out. I am fixing as we speak. PR coming in the next minutes. Thank you!

That was a nasty one!