NiklasPfister / adaXT

adaXT: tree-based machine learning in Python
https://niklaspfister.github.io/adaXT/
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Improve prediction #16

Closed svbrodersen closed 10 months ago

svbrodersen commented 12 months ago

Improve the prediction function, such that it relies solely on c instead of the python looping

WilliamHeuser commented 11 months ago

Start by investigate if it is feasible to use our current predict function that is in pure python. It is normal to run the predict function on a lot of points so it should be fast. It would be a good idea to compare our speed to sklearn. One way to test this would be to run the predict function on all the datapoints that the tree has just been trained on and then compare this to sklearn's predict.

WilliamHeuser commented 11 months ago

I tried running our prediction function on a dataset that was 10000 rows by 25 features and got the following results:

Our prediction runtime: gini: 111 ms entropy: 128 ms squared: 59 ms

sklearn prediction runtime: classification: 1.1360999196767807 regression: 0.9203001391142607

Furthermore I also tried running our prediction on a 80000x25 dataset which could run in about 900 ms. So all in all, we are a lot slower than sklearn, which makes sense as our prediction function runs in pure python. Though I personally think that this method is plenty fast as it is still very much feasible to run predictions even on datasets as large as 80000 rows.

NiklasPfister commented 11 months ago

Thanks for running the test.

I think this will unfortunately be an issue because for random forests this will be multiplied by the number of trees and then these runtimes are a concern. What would need to be changed to speed this up?

svbrodersen commented 11 months ago

For it to increase dramatically we probably have to change the structure of the Nodes to cdef classes instead of python classes, as the main run time is used running is_instance() (I think).

however, running it with the same as @WilliamHeuser with the tree as a pyx file instead, gives a running time of ~20 times greater than sklearn instead of the previous ~100 times for gini_index.

WilliamHeuser commented 11 months ago

After doing some more speedups to the prediction method we have arrived at the following runtimes on a dataset that is 100,000 rows by 25 features: Classification ours: 84 ms Classification SKLearn: 13 ms ~7 times slower

Regression ours: 59 ms Regression SKLearn: 13 ms ~ 5 times slower

The next speedups can be gained by implementing Node, DecisionNode, and LeafNode as pure cython objects instead of python objects. I believe this implementation would also require us to implement Tree, and DepthTreeBuilder as pure cython objects as well. I tried doing this implementation but it quickly became complex and difficult. What are your thoughts @NiklasPfister and @svbrodersen?