dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
903 stars 256 forks source link

Documentation Issue with train_test_split and blockwise #999

Open christhorn2 opened 3 months ago

christhorn2 commented 3 months ago

Describe the issue:

API Documentation of dask train_test_split states that blockwise=False is supported for Arrays: "For Dask Arrays, set blockwise=False to shuffle data between blocks as well." https://ml.dask.org/modules/generated/dask_ml.model_selection.train_test_split.html#dask_ml.model_selection.train_test_split

This is the intention of the code too I think, and it delegates the job to ShuffleSplit: https://github.com/dask/dask-ml/blob/567cfd7837c7616fd352e0efbcfcee42f351199c/dask_ml/model_selection/_split.py#L490

However, ShuffleSplit does not support blockwise=False:

https://github.com/dask/dask-ml/blob/567cfd7837c7616fd352e0efbcfcee42f351199c/dask_ml/model_selection/_split.py#L194

Minimal Complete Verifiable Example:

from dask_ml.model_selection import train_test_split import dask.array as da x = da.arange(8, chunks=4) train_test_split(x,blockwise=false) .... NotImplementedError: ShuffleSplit with blockwise=False has not been implemented yet.

Environment:

narnia24 commented 1 month ago

hey @christhorn2 , can i work on this issue?