OpenBLAS with Pthreads can cause CPU contention for MPI programs with Pthreads

joerowell commented 1 year ago

I think this issue is broadly similar to https://github.com/xianyi/OpenBLAS/issues/2543, but I was asked to provide a bug report for this.

TL;DR: Running MPI programs with pthreads and OpenBLAS can cause CPU contention. This is fixed by setting OPENBLAS_NUM_THREADS=1.

I have a program that is somewhat pathological in it's setup that uses OpenBLAS, so this may not be applicable for all use cases. Specifically, my program looks like this (it's based on G6K).

We have a high-level Sagemath interface that occasionally does certain operations using numpy (and hence OpenBLAS). As far as I am aware, this version of OpenBLAS is compiled to use pthreads and not OpenMP.
We also have a low-level C++ layer that uses MPI (specifically OpenMPI) for cross-machine parallelism and a thread-pool of C++ threads for same-machine parallelism (these are implemented with pthreads on my Linux box). The C++ layer is not aware of OpenBLAS directly.
The thread-pool is configured to use condition variables, allowing threads to sleep if there is no work available (cf. this).
In some settings, we wake all threads using notify_all, which is implemented as pthread_cond_broadcast on my machine.

This tanks performance unless I set OPENBLAS_NUM_THREADS=1. In particular, it appears to restrict my program to running exclusively on 2 cores (on a 20 core machine), regardless of how many threads I start. Moreover, the program spends around 60% of its time across all threads simply synchronising. I find this surprising: my view of how notify_all works is that it shouldn't wake threads that aren't waiting on that particular condition variable.

I think the issue is (cf. https://github.com/xianyi/OpenBLAS/issues/2543) this:

The other problem apart from thread safety is that OpenBLAS will by default try to spawn as many threads as there are available cpu cores - without any knowlege or regard for how many the the program that called it is already using. So this can easily lead to thread contention.

In other words, I think the condition variables are substantially more expensive because the cores are over-subscribed, leading to extra context switches.

I'd like to point out that this is likely also an issue that's exacerbated by OpenMPI, which issues memory fences whenever certain requests are checked, which will make all of this far more expensive.

LMK if anything is unclear / if I can help with this in any way. I suspect the issue is unsolvable in general outside of setting the threads as described above: indeed, in my case I suspect that OpenBLAS starts its threads before my program does, so any sort of checking is likely to be difficult.

martin-frbg commented 1 year ago

Thank you very much for the detailed writeup. I sort-of agree that this looks like a pathological case, but I find the apparent restriction to just 2 cores puzzling. (And I wonder if some middle ground exists, or if threads go crazy the moment you go from OPENBLAS_NUM_THREADS=1 to OPENBLAS_NUM_THREADS=2)

joerowell commented 1 year ago

I didn't actually try OPENBLAS_NUM_THREADS=2: I would guess that this would be fine, provided that I only use 18 threads otherwise (so that the total "in use" core count sums to 20).

FWIW: I think the restriction to 2 cores is arbitrary. I test using 2 MPI processes that each use a single thread for doing MPI communication. If I use (say) n MPI processes, I get n cores being used actively. However, this still presents the same issue with thread contention, and it's faster in general for me to use thread parallelism rather than process parallelism.

yingfeng commented 1 year ago

I've met the similar issue, only under AMD Ryzen 7 5800H CPU. Here are the steps on how to reproduce:

Set openblas_set_num_threads(1), such that openblas only run within single thread.
Create a thread pool, within each pthread running the task of sgemm

The result is:

If the number of thread pool is only 1, the cpu percentage could even reach up to 1600%. The overall execution time is 7s.
If the number of thread pool is increased to 2, the overall time will reach up to 22s. If the number of thread pool increases to 8, the overall execution will take more than 150s.

Additionally, If I pin each thread to CPU:

    static bool pin(std::thread &thread, const std::uint16_t cpu_id) {
        cpu_set_t cpu_set;
        CPU_ZERO(&cpu_set);
        CPU_SET(cpu_id, &cpu_set);

        if (pthread_setaffinity_np(thread.native_handle(), sizeof(cpu_set_t), &cpu_set) != 0) {
            std::cerr << "Can not pin thread!" << std::endl;
            return false;
        }
        return true;

Then the program will not quit no matter what number of the thread pool is.

If I run the same program under Intel CPU (both I5 and I7), no matter whether the thread is pinned to specific CPU, the results is pretty normal:

If the number of thread pool is only 1, the cpu percentage is 100%, the overall execution time is 70s
If the number of thread pool increased to 8, the cpu percentage is 800-1000 %, the overall execution is 8s.

I think there exist some thread contention bugs between OpenMP and pthreads.

martin-frbg commented 1 year ago

OpenMP on Linux relies on pthreads itself, but if OpenBLAS is not built with USE_OPENMP=1 there is no chance of either knowing about the thread usage of the other. I do not think it likely that there is an actual difference between Intel and AMD cpus in this regard, maybe compiler and/or library versions were different in your test as well ?

brada4 commented 1 year ago

seems openblas_set_num_threads() had no effect in "AMD" case. run the hanged case. cat /proc/<pid>/maps Please check for all occurences of thread and omp, you may have 2 of same frameworks loaded for undefined behaviour.

And show what is happening (attach captured output or extract significant-seeming pieces of it)

gdb
gdb> attach <pid>
gdb> thread apply all backtrace
gdb> detach
gdb> quit

martin-frbg commented 8 months ago

Revisiting this, I see no possibility for improvement on OpenBLAS' side, as there is no way (to my knowledge) for the pthreads pool to obtain any information about the size (or even just the presence) of the MPI environment it is running in, and limit its own size accordingly. Using OPENBLAS_NUM_THREADS or the openblas_set_num_threads() function interface would appear to be the best one can do in this context, and I notice that guides like https://enccs.github.io/intermediate-mpi/ stress that mixing MPI with any other threading model adds overhead and potential for contention. Using an OpenMP-enabled OpenBLAS instead of the plain pthread one might be beneficial as one could then use OpenMP environment variables for binding the threads to appropriate cores (but again the number of threads to use in the presence of OpenMPI parallelism cannot be guessed by OpenBLAS/OpenMP)

OpenMathLib / OpenBLAS

OpenBLAS with Pthreads can cause CPU contention for MPI programs with Pthreads #4033