When using train_test_split with shuffle=False and a Dask dataframe, I notice 2 issues - 1) The index is actually shuffled and 2) the train/test size seems incorrect. The behavior doesn't match sklearn or when you pass a raw DataFrame.
Minimal Complete Verifiable Example:
Setup
import pandas as pd
import numpy as np
import dask.dataframe as dd
from sklearn.model_selection import train_test_split as sk_train_test_split
from dask_ml.model_selection import train_test_split as dd_train_test_split
df = pd.DataFrame(np.random.rand(10, 3), columns=["y", "x1", "x2"])
ddf = dd.from_pandas(df, 5)
With sklearn.model_selection, order is maintained (i.e. no shuffle)
y = df["y"]
X = df[["x1", "x2"]]
X_train, X_valid, y_train, y_test = sk_train_test_split(X, y, test_size=0.5, shuffle=False)
y_train, y_test
When using
train_test_split
withshuffle=False
and a Dask dataframe, I notice 2 issues - 1) The index is actually shuffled and 2) the train/test size seems incorrect. The behavior doesn't match sklearn or when you pass a raw DataFrame.Minimal Complete Verifiable Example: Setup
With
sklearn.model_selection
, order is maintained (i.e. no shuffle)With
dask_ml.model_selection
using Pandas Dataframe, order is maintained (i.e. no shuffle)With
dask_ml.model_selection
using Dask Dataframe, , order is NOT maintained and train/test size is incorrect.Environment: