dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

ColumnTransformer horizontally stacks the output #1000

Open reinierstorm opened 2 weeks ago

reinierstorm commented 2 weeks ago

The dask ColumnTransformer stacks the different transformers. The following code (essentially #365) gives an undesirable output

import pandas as pd
import dask.dataframe as dd

import dask_ml.compose
import dask_ml.preprocessing

df = pd.DataFrame({"A": pd.Categorical(["a", "a", "b", "a"]), "B": [1.0, 2, 4, 5]})
ddf = dd.from_pandas(df, npartitions=2).reset_index(drop=True)

ct = dask_ml.compose.ColumnTransformer([
    ("A",  dask_ml.preprocessing.OneHotEncoder(dtype='uint8'), ['A']),  # Example categorical feature
    ("B",  dask_ml.preprocessing.RobustScaler(), ['B'])  # Numeric features
    ],
     )
ct.fit_transform(ddf).compute()

The output I get is:

    A_a     A_b     B
0   1.0     0   NaN
1   1.0     0   NaN
0   0   1.0     NaN
1   1.0     0   NaN
0   NaN     NaN     -1.000000
1   NaN     NaN     -0.666667
0   NaN     NaN     0.000000
1   NaN     NaN     0.333333

The output should be like that of #365

   A_a  A_b         B
0  1.0  0.0 -1.000000
1  1.0  0.0 -0.666667
0  0.0  1.0  0.000000
1  1.0  0.0  0.333333

Environment:

TomAugspurger commented 2 weeks ago

Thanks or the bug report. Is the reset_index(drop=True) component necessary to reproduce that?

Let us know if you’re able to look into this some more.

On Aug 30, 2024, at 4:07 AM, reinierstorm @.***> wrote:

The dask ColumnTransformer stacks the different transformers. The following code (essentially #365 https://github.com/dask/dask-ml/issues/365) gives an undesirable output

import pandas as pd import dask.dataframe as dd

import dask_ml.compose import dask_ml.preprocessing

df = pd.DataFrame({"A": pd.Categorical(["a", "a", "b", "a"]), "B": [1.0, 2, 4, 5]}) ddf = dd.from_pandas(df, npartitions=2).reset_index(drop=True)

ct = dask_ml.compose.ColumnTransformer([ ("A", dask_ml.preprocessing.OneHotEncoder(dtype='uint8'), ['A']), # Example categorical feature ("B", dask_ml.preprocessing.RobustScaler(), ['B']) # Numeric features ], ) ct.fit_transform(ddf).compute() The output I get is:

A_a A_b B 0 1.0 0 NaN 1 1.0 0 NaN 0 0 1.0 NaN 1 1.0 0 NaN 0 NaN NaN -1.000000 1 NaN NaN -0.666667 0 NaN NaN 0.000000 1 NaN NaN 0.333333 The output should be like that of #365 https://github.com/dask/dask-ml/issues/365 A_a A_b B 0 1.0 0.0 -1.000000 1 1.0 0.0 -0.666667 0 0.0 1.0 0.000000 1 1.0 0.0 0.333333 Environment:

dask-ml version: 2024.4.4 dask version: 2024.8.1 Python version:3.10.14 Operating System: Ubuntu 23.04 Install method (conda, pip, source): pip — Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/1000 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIUBNI5ITHOARWUA4EDZUAY4DBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJLJONZXKZNENZQW2ZNLORUHEZLBMRPXI6LQMWBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTLDTOVRGUZLDORPXI6LQMWSUS43TOVS2M5DPOBUWG44SQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKBZGQ2DKNJXGQ2YFJDUPFYGLJLJONZXKZNFOZQWY5LFVIZDIOJWGY3DENJVGWTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

reinierstorm commented 2 weeks ago

Thanks or the bug report. Is the reset_index(drop=True) component necessary to reproduce that?

No it is not.

Let us know if you’re able to look into this some more.

Yes I am able to look into this some more.