TheDigitalFrontier / parallel-decision-trees

Semester project in CS205 Computing Foundations for Computational Science at Harvard School of Engineering and Applied Sciences, spring 2020.
MIT License
3 stars 1 forks source link

Train test split #28

Closed johannes-kk closed 4 years ago

johannes-kk commented 4 years ago

Either returning a set of indices to split the given dataset (@gpestre do we have functionality to index on that like with Numpy?) or returns two datasets similarly to split_dataset. The latter sounds inefficient, but since our dataframes only store vectors of pointers anyway, it wouldn't really make copies as it's just a separate (shuffled?) set of points to the same underlying rows.

johannes-kk commented 4 years ago

Do the latter: similar to split_dataset and sample that splits a certain train/test ratio and returns two DataFrames.

wfseaton commented 4 years ago

Do we need to implement this for DataVector as well? I assume not.

johannes-kk commented 4 years ago

I don't think we do. DataFrames are vectors of DataVectors anyway, so if someone absolutely wants to train/test split a vector they can hack it with a single-column df.

johannes-kk commented 4 years ago

Added #71 as blocker, since the train_test_split implementation offloads the job of shuffling the dataset to its sample call.