Open nprihodko opened 1 month ago
Describe the issue:
dask_ml.compose.ColumnTransformer does not work with objects of types dask_expr._collection.DataFrame or dask.dataframe.core.DataFrame.
dask_ml.compose.ColumnTransformer
dask_expr._collection.DataFrame
dask.dataframe.core.DataFrame
Minimal Complete Verifiable Example:
import numpy as np import pandas as pd from dask_ml.compose import ColumnTransformer from dask_ml.preprocessing import StandardScaler import dask.dataframe as dd from dask.distributed import Client client = Client() # Create a sample dataframe df = pd.DataFrame({"A": np.random.rand(1000)}) ddf = dd.from_pandas(df, npartitions=2)
ColumnTransformer, specifying the columns using strings:
scaler = ColumnTransformer( transformers=[("StandardScaler", StandardScaler(), ["A"])], remainder="passthrough", ) scaler.fit_transform(ddf) # or scaler.fit_transform(ddf.to_legacy_dataframe())
Out:
ValueError: Specifying the columns using strings is only supported for dataframes.
ColumnTransformer, specifying the columns using integers:
scaler = ColumnTransformer( transformers=[("StandardScaler", StandardScaler(), [0])], remainder="passthrough", ) scaler.fit_transform(ddf) # or scaler.fit_transform(ddf.to_legacy_dataframe())
AttributeError: 'DataFrame' object has no attribute 'take'
Anything else we need to know?:
Pandas data frames, i.e.
scaler.fit_transform(ddf.compute())
works as expected.
Could be related to https://github.com/dask/dask-ml/issues/962 and https://github.com/dask/dask-ml/issues/887. If this is the same issue indeed, and there are no plans to fix it in the foreseeable future, could it better to remove it from the Dask ML API?
Environment:
Just based on the error message, this does look like https://github.com/dask/dask-ml/issues/887. What do you think? Are you interested in working on this?
Describe the issue:
dask_ml.compose.ColumnTransformer
does not work with objects of typesdask_expr._collection.DataFrame
ordask.dataframe.core.DataFrame
.Minimal Complete Verifiable Example:
ColumnTransformer, specifying the columns using strings:
Out:
ColumnTransformer, specifying the columns using integers:
Out:
Anything else we need to know?:
Pandas data frames, i.e.
works as expected.
Could be related to https://github.com/dask/dask-ml/issues/962 and https://github.com/dask/dask-ml/issues/887. If this is the same issue indeed, and there are no plans to fix it in the foreseeable future, could it better to remove it from the Dask ML API?
Environment: