OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

Could you elaborate on the combination of OpenBLAS with multi-threading? #2543

Open nh2 opened 4 years ago

nh2 commented 4 years ago

https://github.com/xianyi/OpenBLAS/wiki/Faq/4bded95e8dc8aadc70ce65267d1093ca7bdefc4c#multi-threaded says:

If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following ...

That is good to know, but it is very unspecific.

To aid people debugging problems, could you elaborate a bit on that? What does "will conflict" mean exactly here? How do things break? What are the fundamental technical reasons?


Also, many Linux distributions ship OpenBLAS with OpenMP enabled. And then bindings from other, programming languages that have their own built-in threading which is not based on OpenMP (Haskell, Go, etc) use the distribution-provided packages, often without setting OPENBLAS_NUM_THREADS=1 or openblas_set_num_threads(1).

In my understanding, this violates what's stated in https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded and is thus unsafe. Can you, as a general guideline, confirm whether languages with built-in threading must address this?

Thanks!

martin-frbg commented 4 years ago

This is a bit unspecific as none of the original developers is still available to explain the reasoning behind certain concepts, and very litte documentation was left. What has become clear in the last few years is that the original GotoBLAS was in no way thread-safe, probably owing to the fact that off-the-shelf multicore systems were quite uncommon 10+ years ago. Since then, many issues have been addressed - mostly by rather heavy-handed locking, but there may still be corner cases left to surprise us. The other problem apart from thread safety is that OpenBLAS will by default try to spawn as many threads as there are available cpu cores - without any knowlege or regard for how many the the program that called it is already using. So this can easily lead to thread contention. When OpenBLAS is built with OpenMP enabled, it uses fewer locks in the expectation that the OpenMP framework in the calling program will keep everything in check. This expectation is obviously flawed if that program is not using OpenMP at all. Conversely, when an OpenMP-using program meets an OpenBLAS that is built without OpenMP, it would be largely unaware of what the OpenBLAS threads are up to - as this is easier to detect than the opposite case, OpenBLAS will complain when that situation arises. I think we have at least one open ticket about supporting other alternatives to OpenMP but noone familiar with these.

nh2 commented 4 years ago

@martin-frbg Thanks for your reply! This is already very useful. I think more of this should be added to the wiki / docs -- I imagine there will be more contributors helping out with that situation if this is clearer, as there are some people that are really into threading.

The thread-contention part makes a lot of sense, that is a common problem.

On the topic of thread-safety, a couple of follow-up questions arise (but I understand that you may not be able to answer them fully).

One would be "Is it unsafe to use OpenBLAS compiled/run with OpenMP enabled from a managed, threaded language like Haskell or Go, where some code paths lead through C++ wrapper code that itself uses OpenMP, wile other code paths use OpenBLAS directly from the outer language", and it seems that the answer is "It is probably unsafe as we do not quite understand all parts of the code and USE_LOCKING=1 is the safer default".

Another one is: Is my understanding correct that USE_LOCKING=1 is put only around memory allocation/deallocation (such as OpenBLAS's use of mmap)? As such, is it safe to assume that USE_LOCKING=1 may only negatively impact performance when OpenBLAS functionality is used that allocates memory? If yes, is there a good way to find out which OpenBLAS functions allocate memory / what the patterns are, to judge what the impact might be?

martin-frbg commented 4 years ago

I'd add more to the wiki if I were confident about it - unfortunately I have no local access to big servers and plans to apply for time on one of the regional HPC clusters got delayed by current events. Still need to look into buying cloud time on AWS or similar to assess behaviour on systems with more than about 20 threads. (And I am not really familiar with either Haskell or Go - I rely on the respective communities to report any problems they encounter - which I believe works fairly well with Go and also NumPy/SciPy ) The problem with locking in OpenBLAS is that threads need to coordinate and share information via memory as each acts on a block of the original matrix and may need to access data from other parts. Most of the relevant code is concentrated in driver/others/memory.c and driver/level3 (note memory.c is actually two implementations in one file with a big ifdef around them - some google folks contributed a TLS version but unfortunately dropped out again before all problems were definitely solved)

h-vetinari commented 4 years ago

@martin-frbg: This is a bit unspecific as none of the original developers is still available to explain the reasoning behind certain concepts, and very litte documentation was left. What has become clear in the last few years is that the original GotoBLAS was in no way thread-safe, probably owing to the fact that off-the-shelf multicore systems were quite uncommon 10+ years ago. Since then, many issues have been addressed - mostly by rather heavy-handed locking, but there may still be corner cases left to surprise us.

I'm not sure if you've considered this, but coming up with a coherent design and/or implementing it sounds like it could be an advanced GSoC project, with possible mentoring from yourself and other knowledgeable parties (@xianyi? @wjc404? @brada4? @stevengj?). See also: #2255, #2392.

martin-frbg commented 4 years ago

Thanks for the suggestion, but I do not see how the added complexity of mentoring a (necessarily inexperienced) GSoC participant would ease the pressure on the extremely small number of current developers at this stage.

brada4 commented 4 years ago

The reality is that there will be no long lazy summer for most. Not a beneficial idea for either side to push named student through missing grunt work in the project. But if there is outside summer project that produces measurably beneficial result in form of a pull request - the result is more than welcome. Say ML-focused tuning as part of other project etc.

johnduffymsc commented 4 years ago

Hi. Thank you all for the comments on this issue so far. I’m still a little confused about the use of USE_THREAD, USE_LOCKING and USE_OPENMP when building OpenBLAS.

I’m reading USE_THREAD as “USE_PTHREAD”. So, ...

To build with pthread support I should use USE_THREAD=1, USE_OPENMP=0

To build with OpenMP support I should use USE_THREAD=0, USE_OPENMP=1

To build a serial version I should use USE_THREAD=0, USE_OPENMP=0, but with USE_LOCKING=1 for the reason in the previous comments.

Is my thinking correct.

Best wishes

John

martin-frbg commented 4 years ago

Almost - the OpenMP support sits on top of threads support so USE_THREAD=1 USE_OPENMP=1

johnduffymsc commented 4 years ago

Great, got it. Thanks for the swift reply.