dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 254 forks source link

Bug in ColumnTransformer #962

Open aparnakesarkar opened 1 year ago

aparnakesarkar commented 1 year ago

I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)

Code:

from dask_ml.compose import ColumnTransformer
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

df = pd.read_csv('path/to/csv')

ordinal_cols = [<list of ordinal columns>]
nominal_cols = [<list of nominal columns>]
passthrough_cols =  [<list of passthrough columns>]

transformers = [
    ("ordinal_encoding", OrdinalEncoder(), ordinal_cols),
    ("onehot_encoding", OneHotEncoder(), nominal_cols),
    ('select', 'passthrough', passthrough_cols)
]

preprocessor = ColumnTransformer(transformers=transformers)
df_t = preprocessor.fit_transform(df)

this failed with the Traceback

Traceback (most recent call last):
  File ".../helpers/pydev/pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File ".../python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File ".../dask_testing.py", line 80, in <module>
    df_t = preprocessor.fit_transform(df)
  File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File ".../lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 750, in fit_transform
    return self._hstack(list(Xs))
  File ".../lib/python3.8/site-packages/dask_ml/compose/_column_transformer.py", line 198, in _hstack
    return pd.concat(Xs, axis="columns")
  File ".../lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 368, in concat
    op = _Concatenator(
  File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 458, in __init__
    raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

On further debugging the output from the three steps in the transformer give 3 different types of outputs.

  1. OrdinalEncoder() gives a 2darray
  2. OneHotEncoder() gives a csr_matrix
  3. "passthrough" gives a dataframe

Point where it is failing in dask-ml package is .../python3.8/site-packages/dask_ml/compose/_column_transformer.py line 198 where it is trying to concat the three different types into a an output df

Code snippet:

elif self.preserve_dataframe and (pd.Series in types or pd.DataFrame in types):
            return pd.concat(Xs, axis="columns")

Anything else we need to know?: Shape of my data is (1000, 1076) label encoding 109 ccolumns onehot encoding 1 column passthrough the rest of the columns

I do not want to use remainder="passthrough" param, I want to pass it in the transformers list

Environment:

aparnakesarkar commented 1 year ago

Solution?: The way sklearn processes this is by converting sparse matrix to ndarray

Sklearn code snippet:

Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
return np.hstack(Xs)
mmccarty commented 1 year ago

Hi @aparnakesarkar - Thank you for opening an issue. Would you please update your example to include generated data? See this blog for an example on generating data that reproduces the problem.