dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
903 stars 256 forks source link

sklearn StandardScaler vs dask StandardScaler. #979

Open Arunes007 opened 12 months ago

Arunes007 commented 12 months ago

I am getting different results from sklearn StandardScaler and dask StandardScaler.

scaler_sk = sklearn.preprocessing.StandardScaler()
scaler_d = dask_ml.preprocessing.StandardScaler()

scaler_sk.fit(df_pd[["SUMMESSAGECOUNT"]])
scaler_d.fit(df_dask[["SUMMESSAGECOUNT"]])

Dask scaler

scaler_d.mean_[0], scaler_d.var_[0]
output: (19.157653421114507, 47431.17794342375)

Sklearn Scaler

scaler_sk.mean_[0], scaler_sk.var_[0]
output: (19.157653421114507, 47431.17794342373)

I know the difference is negligible. But it is influencing my model training on prophet. Could you please suggest any way to make them identical without using compute().

TomAugspurger commented 11 months ago

I think that floating point inaccuracies are just a fact of life when you’re doing things in chunks, at least with the algorithms that dask.array uses today. I don’t think there’s anything we can do in dask-ml to address that (but maybe check the source to be sure).

On Dec 1, 2023, at 5:35 AM, Arunesh Singh @.***> wrote:

I am getting different results from sklearn StandardScaler and dask StandardScaler.

scaler_sk = sklearn.preprocessing.StandardScaler() scaler_d = dask_ml.preprocessing.StandardScaler()

scaler_sk.fit(df_pd[["SUMMESSAGECOUNT"]]) scaler_d.fit(df_dask[["SUMMESSAGECOUNT"]]) Dask scaler

scalerd.mean[0], scalerd.var[0] output: (19.157653421114507, 47431.17794342375) Sklearn Scaler

scalersk.mean[0], scalersk.var[0] output: (19.157653421114507, 47431.17794342373) I know the difference is negligible. But it is influencing my model training on prophet. Could you please suggest any way to make them identical without using compute().

— Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/979 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQLOIVBEFL4GC2IBMLYHG6G5BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJLJONZXKZNENZQW2ZNLORUHEZLBMRPXI6LQMWBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTLDTOVRGUZLDORPXI6LQMWSUS43TOVS2M5DPOBUWG44SQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKBZGQ2DKNJXGQ2YFJDUPFYGLJLJONZXKZNFOZQWY5LFVIZDAMRQG4YDCNRYGKTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.