Closed joerowell closed 8 months ago
Thank you very much for the detailed writeup. I sort-of agree that this looks like a pathological case, but I find the apparent restriction to just 2 cores puzzling. (And I wonder if some middle ground exists, or if threads go crazy the moment you go from OPENBLAS_NUM_THREADS=1 to OPENBLAS_NUM_THREADS=2)
I didn't actually try OPENBLAS_NUM_THREADS=2: I would guess that this would be fine, provided that I only use 18 threads otherwise (so that the total "in use" core count sums to 20).
FWIW: I think the restriction to 2 cores is arbitrary. I test using 2 MPI processes that each use a single thread for doing MPI communication. If I use (say) n
MPI processes, I get n
cores being used actively. However, this still presents the same issue with thread contention, and it's faster in general for me to use thread parallelism rather than process parallelism.
I've met the similar issue, only under AMD Ryzen 7 5800H CPU. Here are the steps on how to reproduce:
openblas_set_num_threads(1)
, such that openblas only run within single thread.sgemm
The result is:
Additionally, If I pin each thread to CPU:
static bool pin(std::thread &thread, const std::uint16_t cpu_id) {
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu_id, &cpu_set);
if (pthread_setaffinity_np(thread.native_handle(), sizeof(cpu_set_t), &cpu_set) != 0) {
std::cerr << "Can not pin thread!" << std::endl;
return false;
}
return true;
Then the program will not quit no matter what number of the thread pool is.
If I run the same program under Intel CPU (both I5 and I7), no matter whether the thread is pinned to specific CPU, the results is pretty normal:
I think there exist some thread contention bugs between OpenMP and pthreads.
OpenMP on Linux relies on pthreads itself, but if OpenBLAS is not built with USE_OPENMP=1 there is no chance of either knowing about the thread usage of the other. I do not think it likely that there is an actual difference between Intel and AMD cpus in this regard, maybe compiler and/or library versions were different in your test as well ?
seems openblas_set_num_threads() had no effect in "AMD" case.
run the hanged case.
cat /proc/<pid>/maps
Please check for all occurences of thread and omp, you may have 2 of same frameworks loaded for undefined behaviour.
And show what is happening (attach captured output or extract significant-seeming pieces of it)
gdb
gdb> attach <pid>
gdb> thread apply all backtrace
gdb> detach
gdb> quit
Revisiting this, I see no possibility for improvement on OpenBLAS' side, as there is no way (to my knowledge) for the pthreads pool to obtain any information about the size (or even just the presence) of the MPI environment it is running in, and limit its own size accordingly. Using OPENBLAS_NUM_THREADS or the openblas_set_num_threads() function interface would appear to be the best one can do in this context, and I notice that guides like https://enccs.github.io/intermediate-mpi/ stress that mixing MPI with any other threading model adds overhead and potential for contention. Using an OpenMP-enabled OpenBLAS instead of the plain pthread one might be beneficial as one could then use OpenMP environment variables for binding the threads to appropriate cores (but again the number of threads to use in the presence of OpenMPI parallelism cannot be guessed by OpenBLAS/OpenMP)
I think this issue is broadly similar to https://github.com/xianyi/OpenBLAS/issues/2543, but I was asked to provide a bug report for this.
TL;DR: Running MPI programs with pthreads and OpenBLAS can cause CPU contention. This is fixed by setting OPENBLAS_NUM_THREADS=1.
I have a program that is somewhat pathological in it's setup that uses OpenBLAS, so this may not be applicable for all use cases. Specifically, my program looks like this (it's based on G6K).
notify_all
, which is implemented aspthread_cond_broadcast
on my machine.This tanks performance unless I set
OPENBLAS_NUM_THREADS=1
. In particular, it appears to restrict my program to running exclusively on 2 cores (on a 20 core machine), regardless of how many threads I start. Moreover, the program spends around 60% of its time across all threads simply synchronising. I find this surprising: my view of hownotify_all
works is that it shouldn't wake threads that aren't waiting on that particular condition variable.I think the issue is (cf. https://github.com/xianyi/OpenBLAS/issues/2543) this:
In other words, I think the condition variables are substantially more expensive because the cores are over-subscribed, leading to extra context switches.
I'd like to point out that this is likely also an issue that's exacerbated by OpenMPI, which issues memory fences whenever certain requests are checked, which will make all of this far more expensive.
LMK if anything is unclear / if I can help with this in any way. I suspect the issue is unsolvable in general outside of setting the threads as described above: indeed, in my case I suspect that OpenBLAS starts its threads before my program does, so any sort of checking is likely to be difficult.