Closed johannes-kk closed 4 years ago
Do the latter: similar to split_dataset
and sample
that splits a certain train/test ratio and returns two DataFrames.
Do we need to implement this for DataVector as well? I assume not.
I don't think we do. DataFrames
are vectors of DataVectors
anyway, so if someone absolutely wants to train/test split a vector they can hack it with a single-column df.
Added #71 as blocker, since the train_test_split
implementation offloads the job of shuffling the dataset to its sample
call.
Either returning a set of indices to split the given dataset (@gpestre do we have functionality to index on that like with Numpy?) or returns two datasets similarly to
split_dataset
. The latter sounds inefficient, but since our dataframes only store vectors of pointers anyway, it wouldn't really make copies as it's just a separate (shuffled?) set of points to the same underlying rows.