TheDigitalFrontier / parallel-decision-trees

Semester project in CS205 Computing Foundations for Computational Science at Harvard School of Engineering and Applied Sciences, spring 2020.
MIT License
3 stars 1 forks source link

Move mtry defaults from findBestSplit to DT and RF #75

Closed johannes-kk closed 4 years ago

johannes-kk commented 4 years ago

Currently mtry is passed as a parameter to the DecisionTree or RandomForest constructor, then passed on to findBestSplit. The default should differ for DT and RF: mtry equal to the number of predictors and mtry equal to the square root of that number, respectively.

As such, move setting of mtry default to the parent tree object, and pass that on.

wfseaton commented 4 years ago

If we no longer use mtry==0 to represent sqrt(), can mtry ever be 0?

johannes-kk commented 4 years ago

Hmm... this boils down to another design choice, I think. If we set mtry at the tree level, then a way to implement this in findBestSplit() is that when mtry != num_features then shuffle the column indices, otherwise do not. This assumes that any vanilla DecisionTree will have mtry == num_features (which really it should, otherwise it's not a vanilla CART), while it can still be used as a subtree in a RandomForest because it allows for a smaller mtry (at which point it also shuffles the indices and draws a random subset mtry).

johannes-kk commented 4 years ago

Primarily, what I'd like is that when we pass mtry = 1 to the DecisionTree constructor (i.e. the default value), it sets DecisionTree.mtry = num_features, whereas for RandomForest the default value mtry = -1 means its constructor sets RandomForest.mtry = sqrt(num_features) and it applies shuffling in findBestSplit().