RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.03k stars 182 forks source link

What method is used to compute gini split index on the Classification and regression tree? #61

Open nizariyah opened 4 years ago

nizariyah commented 4 years ago

Is it brute force or midpoint?

andrewdalpino commented 4 years ago

Hi @nizariyah thanks for the question

Classification Tree minimizes Gini impurity and Extra Tree Classifier minimizes entropy at the leaf nodes of the tree. For regression, both Regression Tree and Extra Tree Regressor minimize variance.

Having that said, for continuous features we use a quantile method similar to the technique described in the CLOUDS paper. The search space for continuous features with this method is typically about 1/4th the size of brute force with no loss in overall accuracy. For categorical features we use brute force.

If the best split fails to reduce the impurity by a user-defined amount (minPurityIncrease), it will be immediately pruned and turned into leaf nodes to avoid overfitting.

Thanks and I'll be happy to answer any followups

andrewdalpino commented 4 years ago

Hi @nizariyah

The CART implementation has been tuned and optimized in the latest commit https://github.com/RubixML/RubixML/commit/89f6991794c9ee5e7a2f358c1cc52167450684a8

Instead of using a 0.25 ratio of quantiles to samples at the split node, we are now using the square root of the number of samples. Even with this more aggressive search heuristic, accuracy has not changed in any of our tests including the House Price Preditor example project.

Comparing this strategy to the median strategy that I believe you are referring to, the Quantile method allows you to use the median strategy when the number of samples at the split point are low as well as choose more precise (and efficient) splits when alot of data is available to the splitting algorithm.

Hope this helps, and thanks again for this question. I took some time to optimize the CART implementation as a result and we're seeing almost an order of magnitude speedup for large datasets.

andrewdalpino commented 4 years ago

Performance comparison of PHP library Random Forest on the Dota 2 dataset, 10 tree, 0.5 subsample ratio, categorical features

random-forest-speed

random-forest-accuracy