Closed gpestre closed 4 years ago
So far we've implemented FindBestSplit
in DecisionTree
with an mtry
parameter. Vanilla decision trees always use mtry
equal to the number of features in the dataset, which is why they tend to be so highly correlated. Trees in Random Forests use a smaller mtry
to evaluate a random subset.
For an actual vanilla decision tree, having the mtry
parameter and random subset functionality is superfluous. I didn't point this out earlier as I figured we could implement DecisionTree
with the mtry
subset included, so that our Random Forest simply is a collection of DecisionTrees
with mtry
smaller than the number of features.
From #57 I see that's also how you implemented RF v1, so it looks like we're thinking along the same lines. Long wall of text to point out something obvious, but figured it's good to voice that design choice explicitly.
I wasn't sure myself if vanilla decision trees ever used a subset of features or not, so thanks for the clarification. I followed your original implementation of best_split from the helpers file, which is why the mtry
hyperparameter ended up in the DecisionTree class.
But in any case, I agree that it seems like the best design approach. Especially for parallelization, because it allows the DecisionTree algo to dispatch the training of each tree instance to the DecisionTree class.
Leads to a very simple RandomForest.fit()
:
https://github.com/johannes-kk/cs205_final_project/blob/915d1f125e687e8be9308f889d69e1cddb4b8a12/src/random_forest.cpp#L114:L131
For #57.
I created a
SeedGenerator
indatasets.cpp
to generate a repeatable stream of seeds ((but currently it just returns -1)).For DecisionTree:
mtry != -1
).For RandomForest:
mtry
).