TheDigitalFrontier / parallel-decision-trees

Semester project in CS205 Computing Foundations for Computational Science at Harvard School of Engineering and Applied Sciences, spring 2020.
MIT License
3 stars 1 forks source link

Implement random seed for RandomForest and DecisionTree. #65

Closed gpestre closed 4 years ago

gpestre commented 4 years ago

For #57.

I created a SeedGenerator in datasets.cpp to generate a repeatable stream of seeds ((but currently it just returns -1)).

For DecisionTree:

For RandomForest:

johannes-kk commented 4 years ago

So far we've implemented FindBestSplit in DecisionTree with an mtry parameter. Vanilla decision trees always use mtry equal to the number of features in the dataset, which is why they tend to be so highly correlated. Trees in Random Forests use a smaller mtry to evaluate a random subset.

For an actual vanilla decision tree, having the mtry parameter and random subset functionality is superfluous. I didn't point this out earlier as I figured we could implement DecisionTree with the mtry subset included, so that our Random Forest simply is a collection of DecisionTrees with mtry smaller than the number of features.

From #57 I see that's also how you implemented RF v1, so it looks like we're thinking along the same lines. Long wall of text to point out something obvious, but figured it's good to voice that design choice explicitly.

gpestre commented 4 years ago

I wasn't sure myself if vanilla decision trees ever used a subset of features or not, so thanks for the clarification. I followed your original implementation of best_split from the helpers file, which is why the mtry hyperparameter ended up in the DecisionTree class.

But in any case, I agree that it seems like the best design approach. Especially for parallelization, because it allows the DecisionTree algo to dispatch the training of each tree instance to the DecisionTree class.

Leads to a very simple RandomForest.fit(): https://github.com/johannes-kk/cs205_final_project/blob/915d1f125e687e8be9308f889d69e1cddb4b8a12/src/random_forest.cpp#L114:L131