I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)
Code:
from dask_ml.compose import ColumnTransformer
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
df = pd.read_csv('path/to/csv')
ordinal_cols = [<list of ordinal columns>]
nominal_cols = [<list of nominal columns>]
passthrough_cols = [<list of passthrough columns>]
transformers = [
("ordinal_encoding", OrdinalEncoder(), ordinal_cols),
("onehot_encoding", OneHotEncoder(), nominal_cols),
('select', 'passthrough', passthrough_cols)
]
preprocessor = ColumnTransformer(transformers=transformers)
df_t = preprocessor.fit_transform(df)
this failed with the Traceback
Traceback (most recent call last):
File ".../helpers/pydev/pydevd.py", line 1496, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File ".../python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File ".../dask_testing.py", line 80, in <module>
df_t = preprocessor.fit_transform(df)
File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ".../lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 750, in fit_transform
return self._hstack(list(Xs))
File ".../lib/python3.8/site-packages/dask_ml/compose/_column_transformer.py", line 198, in _hstack
return pd.concat(Xs, axis="columns")
File ".../lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 368, in concat
op = _Concatenator(
File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 458, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
On further debugging the output from the three steps in the transformer give 3 different types of outputs.
OrdinalEncoder() gives a 2darray
OneHotEncoder() gives a csr_matrix
"passthrough" gives a dataframe
Point where it is failing in dask-ml package is .../python3.8/site-packages/dask_ml/compose/_column_transformer.py line 198 where it is trying to concat the three different types into a an output df
Code snippet:
elif self.preserve_dataframe and (pd.Series in types or pd.DataFrame in types):
return pd.concat(Xs, axis="columns")
Anything else we need to know?:
Shape of my data is (1000, 1076)
label encoding 109 ccolumns
onehot encoding 1 column
passthrough the rest of the columns
I do not want to use remainder="passthrough" param, I want to pass it in the transformers list
Hi @aparnakesarkar - Thank you for opening an issue. Would you please update your example to include generated data? See this blog for an example on generating data that reproduces the problem.
I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)
Code:
this failed with the Traceback
On further debugging the output from the three steps in the transformer give 3 different types of outputs.
Point where it is failing in dask-ml package is
.../python3.8/site-packages/dask_ml/compose/_column_transformer.py
line198
where it is trying to concat the three different types into a an output dfCode snippet:
Anything else we need to know?: Shape of my data is (1000, 1076) label encoding 109 ccolumns onehot encoding 1 column passthrough the rest of the columns
I do not want to use remainder="passthrough" param, I want to pass it in the transformers list
Environment: