OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.14k stars 1.46k forks source link

openBLAS nested parallelism #2052

Open SanazGheibi opened 5 years ago

SanazGheibi commented 5 years ago

Hi, we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:

pragma omp parallel num_threads(2)

                {
                        if (omp_get_thread_num() == 0){ 
                           cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                            bs1, bs2, bs3, alpha, pTmpA, bs3, pTmpB, bs2, beta, pTmpC, bs2);
                        }else {
                            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                             bs1, bs2, bs3, alpha, pTmpA2, bs3, pTmpB2, bs2, beta, pTmpC2, bs2);
                        }
                 }   

Here is the issue:

What is going wrong? and how can we have the desired behavior? Is there any way we could know the number of threads inside the cblas_dgemm function?

Thank you very much for your time and help

martin-frbg commented 5 years ago

How big is your matrix ? OpenBLAS will not use more than one thread if the product of the dimensions M, N and K is smaller than SMP_THRESHOLD_MIN*GEMM_MULTITHREAD_THRESHOLD (65535*4 =256k by default)

SanazGheibi commented 5 years ago

Thank you very much @martin-frbg. The matrices are rather large (M = N = K = 1024 or above). So I don't think that is the issue.

martin-frbg commented 5 years ago

I do not think there is a direct way to get the number of threads inside dgemm, you'd either need to look at your running program in a debugger, or instrument interface/gemm.c to print the args.nthreads it has decided to use. Which version of OpenBLAS, what hardware and operating system are you using ?

SanazGheibi commented 5 years ago

We are using OpenBLAS 0.3.5 on AMD opteron 6168 and the OS is Ubuntu 16.04 (Xenial). We have actually done the following: We modified the function cblas_dgemm.c inside the OpenBLAS directory to print out the number of threads at the very beginning of the function. We used printf("%d\n", omp_get_num_threads()). Then compiled the whole library and linked it to our code. We expected that calling cblas_dgemm would cause the number of its internal threads to be printed, but that didn't happen.

martin-frbg commented 5 years ago

You can try the BLAS extension openblas_get_num_threads()

SanazGheibi commented 5 years ago

Thank you very much @martin-frbg . I made this change, but still nothing is printed out.

martin-frbg commented 5 years ago

That is a bit suspicious, are you sure that your program actually loads OpenBLAS at runtime, and not something else (like single-threaded reference BLAS from netlib) through the "alternatives" mechanism of Ubuntu ?

SanazGheibi commented 5 years ago

We explicitly provide the link to libopenblas.so. However, the source code we modify is from an OpenBLAS folder where the only cblas_dgemm.c is inside a folder called lapack-netlib. So that is suspicious as you say. However, if we remove the nested parallelism structure and leave only one call to cblas_dgemm; and if we set the number of openBLAS threads to different values using the environment variable OPENBLAS_NUM_THREADS , then the resulting runtime is sensitive to the number of threads.

brada4 commented 5 years ago

Thats upstream (Netlib LAPACK) stuff that does not run parallel. cblas symbols are provided directly from OpenBLAS without extra wrapper.

martin-frbg commented 5 years ago

Try adding your printout in interface/gemm.c - this file gets compiled twice from the Makefile, once with -DCBLAS and once without, to give both cblas_dgemm and dgemm (as well as sgemm, cgemm, zgemm and their cblas counterparts by (un)defiing DOUBLE and COMPLEX as needed). The BLAS parts of lapack-netlib are not used in OpenBLAS, that directory is only included for LAPACK. (Sorry for not spotting this last night)

brada4 commented 5 years ago

Seeing OpenMP in code - you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.

Complementing to what martin said - you can use ltrace to get list of from which libraries which functions got called, or use perf record ./program ; perf report to find those just using most of CPU time.

More pragmatic approach would be to build against Netlib BLAS provided by Ubuntu, confirming it works at all, then use alternatives to supplant the library with OpenBLAS.

SanazGheibi commented 5 years ago

Thank you very much @martin-frbg . I modified interface/gemm.c and put a print statement in each of the functions, but still nothing is printed out when I run my code. I suspect maybe I am doing the linking in a wrong way.

SanazGheibi commented 5 years ago

Thank you very much @brada4 . I have a question. Could you please explain a little more about

you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.

Actually, I am compiling the code using -fopenmp flag and there are two threads in the outer-level of the nested parallel section. Is that enough? Or is there anything else I should do? I am asking because I read somewhere that openMP threads may conflict with openBLAS threads and I suspect maybe that is somehow related to the support you are talking about.

martin-frbg commented 5 years ago

When you compile your code with "-lopenblas" this does not automatically ensure that exactly the same version of openblas will be loaded at runtime - there might be so other (and potentially older) version installed somewhere in the default library search paths on the system (like /lib, /usr/lib or /usr/local/lib). Running ldd on your program should show which libopenblas gets loaded by default, setting the LD_LIBRARY_PATH environment variable to your directory should make it look there.

brada4 commented 5 years ago

Namely following FAQ entries apply: https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts https://github.com/xianyi/OpenBLAS/wiki/faq#wronglibrary

SanazGheibi commented 5 years ago

Thank you very much @martin-frbg . It worked and now the number of threads is printed out. There is just one other issue: The first time I compiled and linked the library, there was an error that

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option

However, then the number of threads was sensitive to the environment variable OPENBLAS_NUM_THREADS and by changing this variable, the number of threads that were printed out did vary.

After I recompiled the library using USE_OPENMP=1 , there are no more warnings, but now however I modify OPENBLAS_NUM_THREADS, the number of threads that is printed out is always 24 (the maximum number of threads in the system). Is there any way I can fix this problem? Thank you again

SanazGheibi commented 5 years ago

Thank you very much @brada4

brada4 commented 5 years ago

Probably thread safety improved a lot since that warning was introduced and nothing hangs recently. Thread number detected is not as important as total run time reduction

martin-frbg commented 5 years ago

Could be that it is always returning the value of OMP_NUM_THREADS now unfortunately. You can try removing the "#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c (this could be another bug related to my earlier mis-edit uncovered in #2002 - memory.c basically contains two versions of the thread setup code so you will see two definitions of blas_get_cpu_number there). Despite the recent thread safety improvements I do not think it is safe to mix OPENMP and non-OPENMP codes - the OpenMP management functions will not know anything about plain pthreads outside its control...

SanazGheibi commented 5 years ago

Thank you very much @martin-frbg . I removed the

ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c

but it still doesn't work.

And there is another issue: Inside interface/gemm.c, I have put two print statements:

If we remove the nested parallel structure and only call one instance of cblas_dgemm, both the printed values are 24. However, if we use the nested parallel structure, the printf at the beginning of CNAME, prints 24, but the one at the end of CNAME prints 1. What can be going wrong?

And here is our nested parallel structure (so that you don't have to go all the way up in the early posts):

pragma omp parallel num_threads(2)

{ if (omp_get_thread_num() == 0){ //First call, with first set of arguments cblas_dgemm(); }else { //Second call, with second set of arguments cblas_dgemm(); } }

SanazGheibi commented 5 years ago

Thank you very much @brada4 , but for our case we need to know the number of threads in each block. And the other thing is that we are not getting any runtime improvement compared to the case where we call the two functions sequentially and that is really strange. So there may be something wrong with thread distribution and we need to figure that out.

brada4 commented 5 years ago

You can count CPU usage with "time" command - like if user+system > total then you use threads.

martin-frbg commented 5 years ago

args.nthreads in interface/gemm.c should only become 1 when the product of the matrix dimensions is small, perhaps print args.m, args.n, args.k at that point as well to in case your code divides the workload unevenly between the two instances. (Print num_cpu_avail(3) just to be sure, though I do not think it could be 1)

SanazGheibi commented 5 years ago

Thank you @martin-frbg . For our problem args.m = args.n = args.k >= 512. That was verified after interface/gemm.c printed out these values.

However, the return value of num_cpu_avail(3) is printed out as 1. That is quite surprising. Because there are 24 cpus available in our system.

SanazGheibi commented 5 years ago

Thank you @brada4 .

SanazGheibi commented 5 years ago

In accordance to my previous comment: If we only call one instance of cblas_dgemm and remove the nested parallelism, then the output of num_cpu_avail(3) will be 24. Therefore, the idea that maybe the system is in use by other programs cannot hold in this case.

SanazGheibi commented 5 years ago

Another thing that is somehow surprising to me is that if I use the following setting for cup affinity:

setenv GOMP_CPU_AFFINITY "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16"

Then regardless of whether or not we are using nested parallelism, the return value of openblas_get_num_threads() at the beginning of CNAME, the value of args.nthreads at the end of CNAME and the return value of num_cpu_avail(3) will all be 1. What can be the reason for all of this? Thank you again.

brada4 commented 5 years ago

Are you certain you use same openblas library for each test?

SanazGheibi commented 5 years ago

Yes, I am sure. There is only one openBLAS library that is modified to print out number of threads and number of CPUs available. And I am using that.

brada4 commented 5 years ago

You can try omp_get_num_threads() , I think openblasget. just gets number from there.

martin-frbg commented 5 years ago

Meh. Reading the implementation of num_cpu_avail() in common_thread.h, it is hardcoded to return "1" when in an OMP parallel region. (And has been like this since the days of GotoBLAS.) This could be a very old workaround for problems related to thread buffer memory allocation (and rogue overwriting). It will probably take some careful testing to see if the relatively recent introduction of MAX_PARALLEL_NUMBER (NUM_PARALLEL in Makefile.rule) from #1536 is sufficient on its own.

SanazGheibi commented 5 years ago

Thank you very much @brada4 and @martin-frbg . I will go through #1536 and see what I can figure out.

SanazGheibi commented 5 years ago

Thank you again @martin-frbg . I simply commented out

ifdef USE_OPENMP

  || omp_in_parallel()

endif

from common_thread.h and it seems to be working. Now the number of threads inside the openBLAS function can be controlled from the calling function using omp_set_num_threads( ).

However, there is a problem remaining. If we use any of the following affinity settings: setenv OMP_PLACES cores setenv OMP_PROC_BIND close Or setenv GOMP_CPU_AFFINITY "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23" Then the number of threads and the number of CPUs available will turn to 1. I really have no idea why that happens.

brada4 commented 5 years ago

Each calling thread is sort of constrained to one CPU, there is no easy way out from that. OpenBLAS restricts available CPUs also from existing affinity mask to not oversubscribe dockers, lxc etc.

martin-frbg commented 5 years ago

I do not want to put it like that - some of the observed behaviour may simply be the result of even more hidden bugs.

(I am not sure what the documented/expected result of setting GOMP_CPU_AFFINITY to the entire range of available cores is - I would have expected OpenBLAS to handle it the same as if no affinity mask had been set, unless OpenMP itself creates an affinity mask of "all cores" for the first instance and "none" for the second. There is already an open issue - #1653 - about finding and documenting best practices for using OpenBLAS with OpenMP but I feel "more research is needed")

brada4 commented 5 years ago

if _places was sockets it would be optimal for multisocket system. if our affinity mask is one CPU we really dont know if we have any right to break free.

SanazGheibi commented 5 years ago

Thank you very much @brada4 and @martin-frbg . Where in the openBLAS library are the affinity masks handled? Is there any way we could check and possibly modify that?

brada4 commented 5 years ago

See patches linked in #1155 - it introduces parsing affinity mask as a constraint to available CPUs. Please try OMP_PLACES=sockets, that should address your problem completely, your 2 OMP parallel threads will settle in socket each.

martin-frbg commented 5 years ago

cpu enumeration happens in function get_num_procs() of file driver/others/memory.c ...most recently updated in #2008 - beware that there are two occurences of this in memory.c, one for the USE_TLS=1 branch (experimental code using thread-local storage) and one for USE_TLS=0 - you will probably want to use/change the second instance.

brada4 commented 5 years ago

The problem here is that pthread openblas gets side-effect of GOMP pthread setup, which is not the most orthodox configuration.

martin-frbg commented 5 years ago

IIRC GOMP on Linux is implemented on top of pthreads, and the data returned by sched_getaffinity should reflect whatever was defined through GOMP_CPU_AFFINITY. So I do not think there is anything unorthodox about this configuration or its interpretation by get_num_procs(). It could simply be that we have another "ifdef USE_OPENMP, report a single core" elsewhere.

brada4 commented 5 years ago

Here is the reference that those heavily bent confs are those not working: https://github.com/xianyi/OpenBLAS/issues/2052#issuecomment-472708738

SanazGheibi commented 5 years ago

Thank you very much @martin-frbg and @brada4 . I will check and see what I can do.

jakub-homola commented 3 months ago

I am currently dealing with something similar.

In the main top-level readme, there is this line:

If you compile this library with USE_OPENMP=1, you should set the OMP_NUM_THREADS environment variable; OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.

So the reason why openblas_get_num_threads() returned the same as omp_get_num_threads() is because they are both based on the same environment variable.

To achieve the 2x8 nested parallelism, I think you will have to use openblas_set_num_threads() manually inside the parallel region, along with allowing nested OpenMP using e.g. export OMP_MAX_ACTIVE_LEVELS=2. The default-disabled nested OpenMP could have been the reason behind the original issue. I didn't test it, just a suggestion.