TheDigitalFrontier / parallel-decision-trees

Semester project in CS205 Computing Foundations for Computational Science at Harvard School of Engineering and Applied Sciences, spring 2020.
MIT License
3 stars 1 forks source link

Sample without replacement & Finalise train/test split #72

Closed johannes-kk closed 4 years ago

johannes-kk commented 4 years ago
johannes-kk commented 4 years ago

I don't fully understand the sampling function, but the updated train_test_split looks good to me.

The updated sample function has two parts. If replace == true it uses the original code whereby it repeatedly samples the vector of row indices uniformly until the bootstrapped sample has nrow observations. If replace == false it instead shuffles the vector of original row indices, and pulls the first nrow from that, with at most the same number of observations as there are in the original dataframe.

train_test_split then just uses sample to bootstrap a sample with the same number of rows without replacement, meaning it effectively just shuffles the passed dataframe.