cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
547 stars 75 forks source link

Computational performance - optimize for speed #107

Open CBongiova opened 2 years ago

CBongiova commented 2 years ago

Hi,

I am using the ScikitLearn.jl library to train Random Forest classifiers. After the training, I note that re-applying the trained models to new datapoints take about 0.2 seconds. After some tests, it seems that this amount of time is un-related to the number of trees and features. Instead, it seems to be latency time.

I had a look at the scikit-learn webpage here: https://scikit-learn.org/0.15/modules/computational_performance.html Here they mention that the computational performance of scikitlearn heavily relies on Numpy/Scipy and linear algebra and that it makes sense to take care of these libraries. So they propose to check that Numpy is built using an optimized BLAS/LAPACK library, as follows:

from numpy.distutils.system_info import get_info print(get_info('blas_opt')) print(get_info('lapack_opt'))

Any idea of how I can check for this in Julia? Else, do you have any suggestion to speed-up the ScikitLearn.jl predictions?

cstjean commented 2 years ago

Any idea of how I can check for this in Julia?

I can't help directly, but ScikitLearn is built on PyCall.jl. You can check from there how to do that. Something like

using PyCall
sys = pyimport("numpy.distutils.system_info")
sys.getinfo("blas_opt"))

Else, do you have any suggestion to speed-up the ScikitLearn.jl predictions?

Are you making one call with a big n_sample X n_feature matrix to get your predictions?

Apart from that, it all depends on the Python code, so there's not much I can do there. DecisionTrees.jl might provide

CBongiova commented 2 years ago

Hi @cstjean,

Thanks for your reply!

Are you making one call with a big n_sample X n_feature matrix to get your predictions?

No, I actually use the trained random forest classifiers (100 trees) to make atomic predictions online. That is, each time I only have one datapoint with about 45 features. Extracting the features is almost instantaneous, whereas making the predictions takes about 0.1 seconds.

I have actually found this discussion on stack overflow : https://stackoverflow.com/questions/50676717/why-sklearn-random-forest-takes-the-same-time-to-predict-one-sample-than-n-sampl The 0.1 seconds seems to be latency time which is unavoidable with Scikitlearn ... maybe other libraries or ML approaches are more appropriate for real-time applications.

cstjean commented 2 years ago

DecisionTrees.jl supports the ScikitLearn interface, so it shouldn't be too hard to give it a try!