dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 254 forks source link

ColumnTransformer does not work with Dask dataframes #993

Open nprihodko opened 1 month ago

nprihodko commented 1 month ago

Describe the issue:

dask_ml.compose.ColumnTransformer does not work with objects of types dask_expr._collection.DataFrame or dask.dataframe.core.DataFrame.

Minimal Complete Verifiable Example:

import numpy as np
import pandas as pd
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler

import dask.dataframe as dd
from dask.distributed import Client

client = Client()

# Create a sample dataframe
df = pd.DataFrame({"A": np.random.rand(1000)})
ddf = dd.from_pandas(df, npartitions=2)

ColumnTransformer, specifying the columns using strings:

scaler = ColumnTransformer(
    transformers=[("StandardScaler", StandardScaler(), ["A"])],
    remainder="passthrough",
)
scaler.fit_transform(ddf)  # or scaler.fit_transform(ddf.to_legacy_dataframe())

Out:

ValueError: Specifying the columns using strings is only supported for dataframes.

ColumnTransformer, specifying the columns using integers:

scaler = ColumnTransformer(
    transformers=[("StandardScaler", StandardScaler(), [0])],
    remainder="passthrough",
)
scaler.fit_transform(ddf)  # or scaler.fit_transform(ddf.to_legacy_dataframe())

Out:

AttributeError: 'DataFrame' object has no attribute 'take'

Anything else we need to know?:

Pandas data frames, i.e.

scaler.fit_transform(ddf.compute())

works as expected.

Could be related to https://github.com/dask/dask-ml/issues/962 and https://github.com/dask/dask-ml/issues/887. If this is the same issue indeed, and there are no plans to fix it in the foreseeable future, could it better to remove it from the Dask ML API?

Environment:

TomAugspurger commented 1 month ago

Just based on the error message, this does look like https://github.com/dask/dask-ml/issues/887. What do you think? Are you interested in working on this?