adam2392 / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Adding Oblique Trees to the Cythonized Tree Module #2

Open adam2392 opened 3 years ago

adam2392 commented 3 years ago

Describe the workflow you want to enable

We are proposing to add a cythonized module that allows for building oblique trees.

Oblique trees originally proposed by, Breiman 2001 [1], would enable a more flexible class of decision trees. Breiman called a version of these Forest-RC, as compared to what most people call random forest, which he called Forest-RI. He noted:

"Forest-RC does exceptionally well on the synthetic data sets. Overall, it compares more favorably to Adaboost than Forest-RI."

In other words, in Breiman's paper introduction random forest over 20 years ago, with over 75,000 citations, he presented 2 algorithms, and argued that the one which does sparse linear combinations of features worked better than the naive one which most people have used since. More recently, several authors have corroborated these original results [2-5]. Oblique trees in general open up a wide class of flexibility for trees to take advantage of structure in data. Besides empirical evidence, the sparse constraint here borrows from the Johnson Lindenstrauss theorem.

Describe your proposed solution

The implementation of sparse oblique random forest (SPORF) was ported into a scikit-learn fork (https://github.com/adam2392/scikit-learn). The SPORF Cython implementation currently builds cleanly on top of the existing cython code for the regular Decision Trees and thus does not affect the runtime, or performance on Random Forests at all. Moreover, it improves upon RF in certain cc_18 benchmarks (see figure below).

The .pxd files can also be included in each scikit-learn release, so that other sklearn-compatible forest modules can build upon our implementation of the Oblique tree.

Relevant Benchmarks

To ease scikit-learn developer concerns on incorporating such a large piece of code change, we can provide the following benchmarks and tests on Cythonized RF (cysporf). These benchmarks were ran on the OpenML CC-18 dataset. Max features were set to n_features, so runtime was very high for both RF and oblique forests in e.g. Fashion-MNIST.

  1. benchmark tests showing the runtime performance on a sparse parity problem relevant to Random Forests (it's in a jupyter notebook)
  2. benchmark tests on cc_18 vs Random Forests (see two figures below)

image (3) image

  1. overall runtime (training times on cc_18) vs Random Forests

image (1)

A paired wilcoxon signed test is ran comparing SPORF vs RF with a resulting P-Value of 2.5215109864589682e-20. The alternative hypothesis is that SPORF Cohen kappa is > RF Cohen kappa.

Additional context

[1]: Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324 [2]: T. M. Tomita, J. Browne, C. Shen, J. Chung, J. L. Patsolic, B. Falk, J. Yim, C. E. Priebe, R. Burns, M. Maggioni, and J. T. Vogelstein. Sparse Projection Oblique Randomer Forests. Journal of Machine Learninig Research, 2020. [3]: Tyler M Tomita, Mauro Maggioni, and Joshua T Vogelstein. Roflmao: Robust oblique forests with linear matrix operations. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 498–506. Society for Industrial and Applied Mathematics, 2017. [4]: Rainforth, Tom & Wood, Frank. (2015). Canonical Correlation Forests. [5]: Perry, R., Li, A., Huynh, C., Tomita, T.M., Mehta, R., Arroyo, J., Patsolic, J., Falk, B., & Vogelstein, J. (2019). Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks.

jovo commented 3 years ago

great start.

decouple from our work to the extent possible, and highlight Breiman's original comment about how linear combination stuff worked better.

then, cite ROFLMAO, CCF, SPORF, and other recent oblique tree stuff. be explicit in the citations.

do not mention the possibility of keeping it in our repo. i do not think we need to mention who did what.

include the benchmark figures when they are complete.

jovo commented 3 years ago
  1. "Moreover in that paper, he did note that linear combinations performed better then random independent features."

i wanted the quote where he said L-RC was better than F-RC.

  1. "The implementation of SPORF was ported into a scikit-learn fork (https://github.com/adam2392/scikit-learn). The SPORF Cython".

I don't think i'd mention SPORF here at all. just say we implemented sparse oblique forests.

  1. the 2nd figure is not explained, what is delta-kappa, what does >0 mean, what about <0. also, do we have improved results yet?
adam2392 commented 3 years ago
  1. "Moreover in that paper, he did note that linear combinations performed better then random independent features."

i wanted the quote where he said L-RC was better than F-RC.

Hmm I don't think the original RF paper explicitly mentions a linear-RC. Do you mean this one?

"On the larger data sets, F=8 gives better results. Forest-RC does exceptionally well on the synthetic data sets. Overall, it compares more favorably to Adaboost than Forest-RI." Where Forest-RI is our traditional random forests, and Forest-RC is a forest with linear combinations.

I don't think i'd mention SPORF here at all. just say we implemented sparse oblique forests.

+1

  1. the 2nd figure is not explained, what is delta-kappa, what does >0 mean, what about <0. also, do we have improved results yet?

Working on those rn. We've managed to fix the speed and performance issue, so that cy-obliqueforests run on all cc_18 examples when they were previously unable to do so due to speed issues.

Need to run side-by-side rerf comparisons to really understand why they are different and on what datasets. Rn the code semantically looks exactly the same to us.

jovo commented 3 years ago

awesome, yes, that quote, excited for new results. thanks

jovo commented 3 years ago

edit of text

Describe the workflow you want to enable

We are proposing to add a cythonized module that allows for building oblique trees.

Oblique trees originally proposed by, Breiman 2001 [1], would enable a more flexible class of decision trees. Breiman called a version of these Forest-RC, as compared to what most people call random forest, which he called Forest-RI. He noted:

"Forest-RC does exceptionally well on the synthetic data sets. Overall, it compares more favorably to Adaboost than Forest-RI."

In other words, in Breiman's paper introduction random forest over 20 years ago, with over 75,000 citations, he presented 2 algorithms, and argued that the one which does sparse linear combinations of features worked better than the naive one which most people have used since. More recently, several authors have corroborated these original results [2-5]. Oblique trees in general open up a wide class of flexibility for trees to take advantage of structure in data. Besides empirical evidence, the sparse constraint here borrows from the Johnson Lindenstrauss theorem.

Describe your proposed solution

The implementation of sparse oblique random forest (SPORF) was ported into a scikit-learn fork (https://github.com/adam2392/scikit-learn). The SPORF Cython implementation currently builds cleanly on top of the existing cython code for the regular Decision Trees and thus does not affect the runtime, or performance on Random Forests at all. Moreover, it improves upon RF in certain cc_18 benchmarks (see figure below).

The .pxd files can also be included in each scikit-learn release, so that other sklearn-compatible forest modules can build upon our implementation of the Oblique tree.

Relevant Benchmarks

To ease scikit-learn developer concerns on incorporating such a large piece of code change, we can provide the following benchmarks and tests on Cythonized RF (cysporf). These benchmarks were ran on the OpenML CC-18 dataset. Max features were set to n_features, so runtime was very high for both RF and oblique forests in e.g. Fashion-MNIST.

please compute statistical test illustrating that sporf works better than rf