Slowdown when using openblas-pthreads alongside openmp based parallel code

jeremiedbb commented 3 years ago

Hi,

I have a code which mixes BLAS calls (gemm) and OpenMP based parallel loops (they are not nested). When OpenBLAS is built using OpenMP everything is fine but when OpenBLAS is built with pthreads there's a huge slowdown. Below is a reproducible example (sorry it's from python/cython)

%load_ext cython

%%cython -f --compile-args=-fopenmp --link-args=-fopenmp
# cython: profile=True, cdivision=True, boundscheck=False, wraparound=False

import time
import numpy as np
from cython.parallel import prange

def f2(double[:, ::1] A):
     cdef:
         double v = 0
         int m = A.shape[0]
         int n = A.shape[1]
         int i, j

    with nogil:
        for i in prange(m):      # OpenMP parallel for loop (*)
            for j in range(n):
                v += A[i, j]

    return v

def f1(U, V):
    v = 0
    for n_iter in range(100):
        UV = U @ V.T             # BLAS call (gemm)
        v += f2(UV)              # function runs an OpenMP parallel for loop
    return v

U = np.random.randn(10000, 100)
V = np.random.randn(10, 100)

t = time.time()
v = f1(U, V)
print(time.time() - t)

On my laptop (2 physical cores), when I use a sequential loop in (*), it runs in 0.26s. When I use a parallel loop it runs in 2.6s (10x slower). This is with OpenBLAS 0.3.12 built with pthreads. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython.

However, if I use OpenBLAS built with OpenMP, it runs in 0.26s with and without prange. This is with OpenBLAS 0.3.9 built with OpenMP. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython blas[build=openblas] libopenblas=0.3.9.

martin-frbg commented 3 years ago

most likely you get too many threads running in parallel in the non-openmp case

jeremiedbb commented 3 years ago

omp_get_max_threads() returns 4 (I have 2 physical cores + hyperthreading). Do you think OpenBLAS is using too many threads ? It only occurs when I use an OpenMP loop, so it means there's an interaction between OpenBLAS built with pthreads and OpenMP.

jeremiedbb commented 3 years ago

For what it's wotrh the number of running threads showed by htop only increases by 4 when I run this

jeremiedbb commented 3 years ago

a profiling with linux perf shows that most of the time comes from do_wait from libgomp

ogrisel commented 3 years ago

Would be worth checking this:

2020-07-17 Conda-forge is building openblas with both pthreads and openmp on Linux The main change is that openblas will use pthreads for threading by default on Linux instead of the previous openmp default. The openmp builds can be recovered by installing libopenblas=*=*openmp*.

from https://conda-forge.org/docs/user/announcements.html

isuruf commented 3 years ago

Is it really libgomp? Note that we are using libomp (from LLVM) with openblas openmp build because libgomp is not fork safe. If you are loading libgomp.so.3 from conda-forge, it's a symlink to libomp, but if you are loading libgomp.so.3 from the system it'll be the real libgomp and therefore you'll have two instances of openmp loaded which is not advisable.

jeremiedbb commented 3 years ago

My main concern is that the issue happens with OpenBLAS built with pthreads (not when built with OpenMP) !

When I execute my snippet above, libopenblas is loaded because numpy is imported and libgomp is loaded because of the prange loop. It can be confirmed with

from threadpoolctl import threadpool_info
threadpool_info()
[{'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libopenblasp-r0.3.12.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.12',
  'num_threads': 4,
  'threading_layer': 'pthreads'},
 {'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libgomp.so.1.0.0',
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 4}]

If I replace prange by range, libgomp disappears from the list and if I don't import numpy libopenblas disappears.

It also confirms that I have only 1 OpenMP runtime loaded (threadpoolctl takes symlinks into account). Also, for this specific environment (conda create -n tmp -c conda-forge python numpy cython ipython) it's libgomp and not libomp which is installed by conda-forge

libgomp            conda-forge/linux-64::libgomp-9.3.0-h2828fa1_19

isuruf commented 3 years ago

Thanks for the info. That's to be expected though. When running openmp, you create a few threads and then each thread calls blas which in turn creates more threads. In a multi-threaded environment, it's safest to just do openblas_set_num_threads(1) before calling blas.

jeremiedbb commented 3 years ago

It's not an oversubscription issue here. They are not nested. I first call gemm and then call a function which executes a parallel loop (with no blas inside).

jeremiedbb commented 3 years ago

Also I ran ps -o nlwp <pid> on the process running my python script and it returns 8 which is what I'd expect: 4 from openblas and 4 from openmp

isuruf commented 3 years ago

I see the same issue. 30x slowdown with libgomp and 2x slowdown with libomp.

isuruf commented 3 years ago

OMP_PROC_BIND=TRUE reduces the slowdown in libgomp.

brada4 commented 3 years ago

OpenBLAS assumes each thread has CPU at its hands with all outermost caches. 30x slowdoen means multiple threads are stuck on same CPU and actually get to spill to the main memory instead of cache. You can set OPENBLAS_NUM_THREADS=1 running samples right now, before building OMP version to address oversubscription issue. pthread version has no idea of OpenMP spinnnging full room of threads, and spins up own in each thread. EDIT: hyperthreads do not add any value to HPC.You can disable them to get single-threaded stuff end faster, parallel will have half double-speed cores essentially, i.e same timing as with HT

jeremiedbb commented 3 years ago

Sorry I'm not sure to understand your answer. I agree that HT is useless in HPC most of the time but it does not seem to be the only issue here since the program is fast when OpenBLAS is built with pthreads and I run the loop in sequential mode. The issue only appears when I also run the loop in parallel with OpenMP. I recall that they are not nested but run one after the other.

I post a pure C reproducible code here, hope it will make my concerns clearer.

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "cblas.h"

double f2(double *X, int n){
    double v = 0.0;

    #pragma omp parallel for reduction(+:v)
    for(int i=0; i<n; i++){
        v += X[i];
    }

    return v;
}

int main(int argc, char **argv){
    int m = 10000,
        n = 10,
        k = 100;

    double *A = (double*) malloc(m * k * sizeof(double)),
           *B = (double*) malloc(n * k * sizeof(double)),
           *C = (double*) malloc(m * n * sizeof(double));

    for(int i=0; i<m*k; i++){
        A[i] = 0.1;
    }
    for(int i=0; i<n*k; i++){
        B[i] = 1.2;
    }

    double v = 0.0;

    for(int i=0; i<1000; i++){
        // BLAS call
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0, A, k, B, k, 0.0, C, n);    // C <- A @ B.T
        // Followed by parallel loop
        v += f2(C, m * n);
    }

    free(A), free(B), free(C);

    return 0;
}

Here a the timings on a 40 cores machine (20 physical + HT).		openblas pthreads	openblas openmp
sequential loop	1.22s	0.62s
parallel loop	21s	0.51s

Focusing on the pthreads + parallel loop case:

If I set OPENBLAS_NUM_THREADS=20 and OMP_NUM_THREADS=20 the result becomes 0.34s which confirms that HT is detrimental but I don't understand what is the difference with the openblas openmp + sequential loop case.
If I set OPENBLAS_NUM_THREADS=21 and OMP_NUM_THREADS=21, the result is 0.5s, i.e. 50% slower than above, which I would not expect since it's only 1 more thread.
If I set OPENBLAS_NUM_THREADS=22 and OMP_NUM_THREADS=22, the result is 5s ! I don't think the only issue here is basic oversubscription.

isuruf commented 3 years ago

@jeremiedbb, can you also try with OMP_PROC_BIND=TRUE?

jeremiedbb commented 3 years ago

can you also try with OMP_PROC_BIND=TRUE?

@isuruf it reduces the time from 21s to 2.6s. It's better but still much slower than expected

brada4 commented 3 years ago

That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad. 40-cores machine might be 2 sockets with CoD each, i.e 4 sockets. Fitting task into one CoD node (10 threads) will make it fall on a single cache (uncore) complex and perfrom optimally, you can manage 4 of those via OMP, a bit of a hack but works most of the time. PS you think 50x slowdown is bad.... 8-socket systems with subpar data locality do like 500x down....

isuruf commented 3 years ago

That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad.

OpenBLAS pthread behaviour should not be affected by what openmp does since the openblas calls are from the main thread and an unrelated work happens in the openmp threads.

brada4 commented 3 years ago

behaviour should not be affected

OMP placement policy actually sets CPU affinity for OMP threads, so all the following pthreads cannot escape that. There is no hidden side-step API that nobody else uses.

ogrisel commented 3 years ago

@jeremiedbb maybe there is a way to introspect the CPU affinity of new pthreads started before and after calling the OMP f2 function in your minimal reproducer? That would help us confirm whether or not this is the cause of the problem.

https://man7.org/linux/man-pages/man3/pthread_getaffinity_np.3.html

brada4 commented 3 years ago

cat /proc/self/status | grep _allowed

jeremiedbb commented 3 years ago

here's the output of cat /proc/self/status | grep _allowed

Cpus_allowed:   ff,ffffffff
Cpus_allowed_list:  0-39
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:  0

Sorry but I've no idea how to interpret this :/

brada4 commented 3 years ago

Put a one-minute sleep between OMP loops and check thread binding to CPU cores. It is actually documented in GOMP manual pages on any Linux system.

jeremiedbb commented 3 years ago

I introspected the affinities for both OpenMP and OpenBLAS threadpools and it turns out that no affinity constraint is set (openblas is built with NO_AFFINITY). Here's the output of sched_getaffinity (I also checked the /proc/thread-self/status file and the results are the same) from the snippet:

* iteration 0
-- BLAS --
    # thread 34120
      - sched_getcpu: 2
      - sched_getaffinity: 0, 1, 2, 3, 
    # thread 34119
      - sched_getcpu: 3
      - sched_getaffinity: 0, 1, 2, 3, 
    # thread 34121
      - sched_getcpu: 0
      - sched_getaffinity: 0, 1, 2, 3, 
-- OMP --
    # thread 34124
      - sched_getcpu: 0
      - sched_getaffinity: 0,1,2,3,
    # thread 34122
      - sched_getcpu: 3
      - sched_getaffinity: 0,1,2,3,
    # thread 34123
      - sched_getcpu: 1
      - sched_getaffinity: 0,1,2,3,
    # thread 34118
      - sched_getcpu: 0
      - sched_getaffinity: 0,1,2,3,

So the affinity seems to not be the reason of the bad interaction between openblas-pthreads and openmp. However I found that when the openmp loop ends, the threads are still waiting for computation in an active way (OMP_WAIT_POLICY), which consumes resources and prevent openblas to start computations right away. By default, openmp makes waiting threads spin for a while.

Unfortunately, setting OMP_WAIT_POLICY=passive does not really improve the performances on a machine with many cores for some reason that I don't understand yet. The best solution I found so far is to set the num threads for both to half the number of threads, besides building openblas with openmp of course.

I guess this is a wont fix from the OpenBLAS side. OpenMP programs do not interact well with other libraries managing their own threadpool. Feel free to close the issue if you think there's no more to add about that.

Still I wonder if there is the same kind of wait policy in openblas.

jeremiedbb commented 3 years ago

@isuruf I think this issue is a good reason to always try to use an openblas built with openmp for the scikit-learn builds on conda-forge (I noticed it was not always the case).

martin-frbg commented 3 years ago

@jeremiedbb OpenBLAS does have a similar wait policy for its threads, governed by the value of THREAD_TIMEOUT at build time (or the environment variable OPENBLAS_THREAD_TIMEOUT at runtime) which defines the number of clock cycles to wait as 1 << n where the default n is 26 (which according to the comment in Makefile.rule is equivalent to about 25ms at 3GHz).

martin-frbg commented 9 months ago

Closing after copying the relevant information to the FAQ in the wiki

OpenMathLib / OpenBLAS

Slowdown when using openblas-pthreads alongside openmp based parallel code #3187