adam2392 / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Adding Oblique Trees to the Cythonized Tree Module #1

Closed adam2392 closed 3 years ago

adam2392 commented 3 years ago

This is a draft Issue to file to the actual repo. cc: @rflperry @ChesterHuynh @jovo

Describe the workflow you want to enable

Oblique trees originally proposed by, Breiman 2001 1, would enable a more flexible class of decision trees. These are formed by linear combinations (possibly sparse) of the input data, which has been shown to perform better on a wide range of simulated and benchmark datasets in 2. Oblique trees would enable sparse projection oblique random forests (SPORF). Besides empirical evidence, the sparse constraint here borrows from the Johnson Lindenstrauss theorem.

Describe your proposed solution

The implementation of SPORF was ported into a scikit-learn fork (https://github.com/adam2392/scikit-learn), which was started by Parth Vora, and then finished by Chester Huynh, Adam Li and members of Joshua Vogelstein's lab. The SPORF Cython implementation currently builds cleanly on top of the existing cython code for the regular Decision Trees and thus does not affect the runtime, or performance on Random Forests at all.

The .pxd files can also be included in each scikit-learn release, so that other sklearn-compatible forest modules can build upon our implementation of the Oblique tree.

Describe alternatives you've considered, if relevant

As with all proposals to scikit-learn, an alternative could be to just keep things in the rerf package. However, this would limit its extensibility and usability in the community. The idea of oblique forests has been proposed and been around since 2001, so it has in some sense stood the test of time as being a potentially useful idea in machine learning to make decision trees more flexible.

Relevant Benchmarks

To ease scikit-learn developer concerns on incorporating such a large piece of code change, we can provide the following benchmarks and tests on Cythonized SPORF:

  1. benchmark tests showing the runtime performance on a sparse parity problem (see 2) relevant to Random Forests
  2. benchmark tests on cc_18 vs Random Forests

Additional context

The original implementation was a pure C++ implementation that utilizes many of the lower-level features of C++ (https://sporf.neurodata.io). Moreover, more structured trees (i.e. manifold trees) that assume some manifold structure in the input data, can be built on top of the oblique tree codebase (by inheriting class structures from the Python/Cython code in scikit-learn). Some preliminary results on manifold oblique random forests are in [3], which help provide more empirical evidence that a base oblique forest class would be useful in scikit-learn.