Closed jeremiedbb closed 9 months ago
most likely you get too many threads running in parallel in the non-openmp case
omp_get_max_threads()
returns 4 (I have 2 physical cores + hyperthreading). Do you think OpenBLAS is using too many threads ?
It only occurs when I use an OpenMP loop, so it means there's an interaction between OpenBLAS built with pthreads and OpenMP.
For what it's wotrh the number of running threads showed by htop only increases by 4 when I run this
a profiling with linux perf shows that most of the time comes from do_wait
from libgomp
Would be worth checking this:
2020-07-17 Conda-forge is building openblas with both pthreads and openmp on Linux The main change is that openblas will use pthreads for threading by default on Linux instead of the previous openmp default. The openmp builds can be recovered by installing
libopenblas=*=*openmp*
.
Is it really libgomp
? Note that we are using libomp
(from LLVM) with openblas openmp build because libgomp
is not fork safe. If you are loading libgomp.so.3
from conda-forge, it's a symlink to libomp
, but if you are loading libgomp.so.3
from the system it'll be the real libgomp and therefore you'll have two instances of openmp loaded which is not advisable.
My main concern is that the issue happens with OpenBLAS built with pthreads (not when built with OpenMP) !
When I execute my snippet above, libopenblas is loaded because numpy is imported and libgomp is loaded because of the prange loop. It can be confirmed with
from threadpoolctl import threadpool_info
threadpool_info()
[{'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libopenblasp-r0.3.12.so',
'prefix': 'libopenblas',
'user_api': 'blas',
'internal_api': 'openblas',
'version': '0.3.12',
'num_threads': 4,
'threading_layer': 'pthreads'},
{'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libgomp.so.1.0.0',
'prefix': 'libgomp',
'user_api': 'openmp',
'internal_api': 'openmp',
'version': None,
'num_threads': 4}]
If I replace prange by range, libgomp disappears from the list and if I don't import numpy libopenblas disappears.
It also confirms that I have only 1 OpenMP runtime loaded (threadpoolctl takes symlinks into account).
Also, for this specific environment (conda create -n tmp -c conda-forge python numpy cython ipython
) it's libgomp and not libomp which is installed by conda-forge
libgomp conda-forge/linux-64::libgomp-9.3.0-h2828fa1_19
Thanks for the info. That's to be expected though. When running openmp, you create a few threads and then each thread calls blas which in turn creates more threads. In a multi-threaded environment, it's safest to just do openblas_set_num_threads(1)
before calling blas.
It's not an oversubscription issue here. They are not nested. I first call gemm and then call a function which executes a parallel loop (with no blas inside).
Also I ran ps -o nlwp <pid>
on the process running my python script and it returns 8 which is what I'd expect: 4 from openblas and 4 from openmp
I see the same issue. 30x slowdown with libgomp and 2x slowdown with libomp.
OMP_PROC_BIND=TRUE
reduces the slowdown in libgomp.
OpenBLAS assumes each thread has CPU at its hands with all outermost caches. 30x slowdoen means multiple threads are stuck on same CPU and actually get to spill to the main memory instead of cache. You can set OPENBLAS_NUM_THREADS=1 running samples right now, before building OMP version to address oversubscription issue. pthread version has no idea of OpenMP spinnnging full room of threads, and spins up own in each thread. EDIT: hyperthreads do not add any value to HPC.You can disable them to get single-threaded stuff end faster, parallel will have half double-speed cores essentially, i.e same timing as with HT
Sorry I'm not sure to understand your answer. I agree that HT is useless in HPC most of the time but it does not seem to be the only issue here since the program is fast when OpenBLAS is built with pthreads and I run the loop in sequential mode. The issue only appears when I also run the loop in parallel with OpenMP. I recall that they are not nested but run one after the other.
I post a pure C reproducible code here, hope it will make my concerns clearer.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "cblas.h"
double f2(double *X, int n){
double v = 0.0;
#pragma omp parallel for reduction(+:v)
for(int i=0; i<n; i++){
v += X[i];
}
return v;
}
int main(int argc, char **argv){
int m = 10000,
n = 10,
k = 100;
double *A = (double*) malloc(m * k * sizeof(double)),
*B = (double*) malloc(n * k * sizeof(double)),
*C = (double*) malloc(m * n * sizeof(double));
for(int i=0; i<m*k; i++){
A[i] = 0.1;
}
for(int i=0; i<n*k; i++){
B[i] = 1.2;
}
double v = 0.0;
for(int i=0; i<1000; i++){
// BLAS call
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0, A, k, B, k, 0.0, C, n); // C <- A @ B.T
// Followed by parallel loop
v += f2(C, m * n);
}
free(A), free(B), free(C);
return 0;
}
Here a the timings on a 40 cores machine (20 physical + HT). | openblas pthreads | openblas openmp | |
---|---|---|---|
sequential loop | 1.22s | 0.62s | |
parallel loop | 21s | 0.51s |
Focusing on the pthreads + parallel loop case:
OPENBLAS_NUM_THREADS=20
and OMP_NUM_THREADS=20
the result becomes 0.34s which confirms that HT is detrimental but I don't understand what is the difference with the openblas openmp + sequential loop case.OPENBLAS_NUM_THREADS=21
and OMP_NUM_THREADS=21
, the result is 0.5s, i.e. 50% slower than above, which I would not expect since it's only 1 more thread.OPENBLAS_NUM_THREADS=22
and OMP_NUM_THREADS=22
, the result is 5s ! I don't think the only issue here is basic oversubscription.@jeremiedbb, can you also try with OMP_PROC_BIND=TRUE
?
can you also try with OMP_PROC_BIND=TRUE?
@isuruf it reduces the time from 21s to 2.6s. It's better but still much slower than expected
That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad. 40-cores machine might be 2 sockets with CoD each, i.e 4 sockets. Fitting task into one CoD node (10 threads) will make it fall on a single cache (uncore) complex and perfrom optimally, you can manage 4 of those via OMP, a bit of a hack but works most of the time. PS you think 50x slowdown is bad.... 8-socket systems with subpar data locality do like 500x down....
That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad.
OpenBLAS pthread behaviour should not be affected by what openmp does since the openblas calls are from the main thread and an unrelated work happens in the openmp threads.
behaviour should not be affected
OMP placement policy actually sets CPU affinity for OMP threads, so all the following pthreads cannot escape that. There is no hidden side-step API that nobody else uses.
@jeremiedbb maybe there is a way to introspect the CPU affinity of new pthreads started before and after calling the OMP f2
function in your minimal reproducer? That would help us confirm whether or not this is the cause of the problem.
https://man7.org/linux/man-pages/man3/pthread_getaffinity_np.3.html
cat /proc/self/status | grep _allowed
here's the output of cat /proc/self/status | grep _allowed
Cpus_allowed: ff,ffffffff
Cpus_allowed_list: 0-39
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
Sorry but I've no idea how to interpret this :/
Put a one-minute sleep between OMP loops and check thread binding to CPU cores. It is actually documented in GOMP manual pages on any Linux system.
I introspected the affinities for both OpenMP and OpenBLAS threadpools and it turns out that no affinity constraint is set (openblas is built with NO_AFFINITY). Here's the output of sched_getaffinity
(I also checked the /proc/thread-self/status file and the results are the same) from the snippet:
* iteration 0
-- BLAS --
# thread 34120
- sched_getcpu: 2
- sched_getaffinity: 0, 1, 2, 3,
# thread 34119
- sched_getcpu: 3
- sched_getaffinity: 0, 1, 2, 3,
# thread 34121
- sched_getcpu: 0
- sched_getaffinity: 0, 1, 2, 3,
-- OMP --
# thread 34124
- sched_getcpu: 0
- sched_getaffinity: 0,1,2,3,
# thread 34122
- sched_getcpu: 3
- sched_getaffinity: 0,1,2,3,
# thread 34123
- sched_getcpu: 1
- sched_getaffinity: 0,1,2,3,
# thread 34118
- sched_getcpu: 0
- sched_getaffinity: 0,1,2,3,
So the affinity seems to not be the reason of the bad interaction between openblas-pthreads and openmp. However I found that when the openmp loop ends, the threads are still waiting for computation in an active way (OMP_WAIT_POLICY), which consumes resources and prevent openblas to start computations right away. By default, openmp makes waiting threads spin for a while.
Unfortunately, setting OMP_WAIT_POLICY=passive does not really improve the performances on a machine with many cores for some reason that I don't understand yet. The best solution I found so far is to set the num threads for both to half the number of threads, besides building openblas with openmp of course.
I guess this is a wont fix from the OpenBLAS side. OpenMP programs do not interact well with other libraries managing their own threadpool. Feel free to close the issue if you think there's no more to add about that.
Still I wonder if there is the same kind of wait policy in openblas.
@isuruf I think this issue is a good reason to always try to use an openblas built with openmp for the scikit-learn builds on conda-forge (I noticed it was not always the case).
@jeremiedbb OpenBLAS does have a similar wait policy for its threads, governed by the value of THREAD_TIMEOUT at build time (or the environment variable OPENBLAS_THREAD_TIMEOUT at runtime) which defines the number of clock cycles to wait as 1 << n
where the default n is 26 (which according to the comment in Makefile.rule is equivalent to about 25ms at 3GHz).
Closing after copying the relevant information to the FAQ in the wiki
Hi,
I have a code which mixes BLAS calls (gemm) and OpenMP based parallel loops (they are not nested). When OpenBLAS is built using OpenMP everything is fine but when OpenBLAS is built with pthreads there's a huge slowdown. Below is a reproducible example (sorry it's from python/cython)
On my laptop (2 physical cores), when I use a sequential loop in (*), it runs in 0.26s. When I use a parallel loop it runs in 2.6s (10x slower). This is with OpenBLAS 0.3.12 built with pthreads. This conda env allows to reproduce
conda create -n tmp -c conda-forge python numpy cython ipython
.However, if I use OpenBLAS built with OpenMP, it runs in 0.26s with and without prange. This is with OpenBLAS 0.3.9 built with OpenMP. This conda env allows to reproduce
conda create -n tmp -c conda-forge python numpy cython ipython blas[build=openblas] libopenblas=0.3.9
.