BCG-X-Official / sklearndf

DataFrame support for scikit-learn.
https://bcg-x-official.github.io/sklearndf/
Apache License 2.0
63 stars 7 forks source link

Cannot use ColumnTransformerDF inside of StackingRegressorDF #176

Closed JonasRauch closed 3 years ago

JonasRauch commented 3 years ago

Summary:

Using StackingRegressorDF on pipelines containing a ColumnTransformerDF raises an error on .fit.

Using a StackingRegressorDF as the last part of a PipelineDF works as expected. But creating multiple PipelineDF objects with ColumnTransformerDF and then stacking these fails with the following error:

TypeError: StackingRegressorDF.fit: ColumnTransformerDF.fit_transform: arg y must be None, or a pandas Series or DataFrame

Root cause

Most likely the reason is this line in StackingRegressor.fit:

y = column_or_1d(y, warn=True)

Reproduceable example:

from sklearndf.pipeline import PipelineDF
from sklearndf.regression import LinearRegressionDF, ElasticNetDF
from sklearndf.transformation import ColumnTransformerDF, StandardScalerDF
from sklearndf.regression import StackingRegressorDF

import pandas as pd
import numpy as np

# toy data set
np.random.seed(1)
data = pd.DataFrame({
    'x1': np.random.uniform(size=(10,)),
    'x2': np.random.uniform(size=(10,)),
    'y': np.random.uniform(size=(10,)),
})

# basic building blocks
model1 = LinearRegressionDF()
model2 = ElasticNetDF()
preprocessing = ColumnTransformerDF([
    ('x1', StandardScalerDF(), ['x1']),
    ('x2', 'passthrough', ['x1']),
])

# Pipeline with stack works
pipeline = PipelineDF([
    ('preprocessing', preprocessing),
    ('stack', StackingRegressorDF([
        ('model1', model1),
        ('model2', model2),
    ]))
])
pipeline.fit(data, data['y'])
print(pipeline.predict(data))

# Stack of Pipelines doesn't
stack_of_pipelines = StackingRegressorDF([
    ('pipeline1', PipelineDF([
        ('preprocessing', preprocessing),
        ('model1', model1)
    ])),
    ('pipeline2', PipelineDF([
        ('preprocessing', preprocessing),
        ('model2', model1)
    ]))
])
stack_of_pipelines.fit(data, data['y'])
j-ittner commented 3 years ago

@JonasRauch thanks for spotting this. We just released new versions of sklearndf to address this - let us know how this works for you.