dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
187 stars 61 forks source link

Experiment should raise warning if unoptimized numpy is not being used #550

Open thcrock opened 5 years ago

ivanhigueram commented 5 years ago

scikit-learn operations are based in NumPy and SciPy matrix operations. Both packages offer C-compiled functions to make functions faster. Most of this compelling relies on C libraries like BLAS, ATLAS, and MLK. The first two are open-source libraries usually available in both Debian and Ubuntu. The last one is made by Intel are is optimized for parallel operations on Intel processors.

NumPy is usually compiled using this, but sometimes that compiling can fail and NumPy will resort to fallback operations, which are not fast. This shouldn't be a triage issue, is more about infrastructure, but it can reduce triage modeling time considerably. One possible test is to explore the NumPy configuration:

#[assuming that NumPy is imported as usual]
np.__config__.show()

We want to check is openblas is pointing to the correct libraries and that the np.__config__openblas_info dictionary is not empty.

One alternative to avoid this issue is to use Anaconda's Python with all the environment management. Anaconda not only offers a C-compiled Python with openblas an atlas_x_x, but also a free version of mlk, which with the Xeon processors that we use in AWS would increase the processing time quite a lot.

A pretty naive test With openblas correctly installed, the Eigendecomposition of a squared 2048 matrix takes 10.47 s. , without openblas 97.20 s

jesteria commented 5 years ago

@ivanhigueram I'd be skeptical of the reliability of a speed test, (though that result is certainly worth noting).

It looks like we can check numpy.show_config(), (or numpy.__config__.show() if that's better?); though, I don't personally know what the test of that method's output would be.

ivanhigueram commented 5 years ago

One good check can be if Numpy was compiled with open_blas. You can check that using np.__config__.openblas_info. If it returns an empty dictionary, then it wasn't compiled with it and is using fallback to make calculations. The same can be done with mkl_info, which at the moment is only available in Conda's Numpy.

About the speed or efficiency, it is a long shot. I don't think this is going to improve our runtime by a lot, but it's worth to use the right libraries for the operations we are doing. The Eigenvalue decomposition is made several thousands of times every time you're running a tree. Memory and other characteristics are also open to consideration, but at least working with tuned tools is better than relying to fallback methods.

jesteria commented 5 years ago

Is it a long-shot that these libraries improve performance? If they will significantly improve performance, even only in some cases, without significantly diminishing performance otherwise, I certainly think it makes sense to recommend them.

As for determining whether a performant library is installed in the environment, I sincerely doubt that relying on a speed test will be reliable. But it sounds like we can simply check whether those config info methods return empty or non-empty dictionaries, for example. :+1: