gmargari commented 4 years ago

Hi devs, and thanks in advance for all your time and effort you put into this project,

I'm currently benchmarking BLIS with multiple threads, and I'm seeing some (at least for me) non-expected behaviour. I'm suspecting I have misconfigured something, but can't seem to find what.

I'm building BLIS 0.7.0 with:

./configure --enable-threading=openmp auto
make -j `nproc`

on Centos 7.6 with gcc 4.9.2, running on a 6-core/12-thread i7-6800K. It selects the Haswell configuration. (unfortunately this is done on our internal private network, so I can't transfer logs, code etc...)

I benchmark a single function from each BLAS level, namely DDOT, DGEMV, DGEMM, for various random large matrices and vectors (e.g. dimension = 4096). I use the BLAS compatibility layer and link with dynamic library.

I set the number of threads using BLIS_NUM_THREADS. At least for DGEMV I also tried both bli_thread_set_max_threads(), and the expert BLIS interface and setting the num threads using rntm_t per call. The number of threads are correctly returned from bli_thread_get_num_threads().

The strange behaviors I have seen are:

BLIS "ignoring" the num threads for DDOT, DGEMV (i.e. time is not reduced when number of threads is increased from 1 to 2 and 4. Also I see in htop that only 1 thread is active)
time considerably increasing (e.g. an order of magnitude) for DGEMM when I increase the number of threads from 2 to 4.

In the last case, I used valgrind and saw that the majority of time (> 90%) was spent on bli_thrcomm_barrier_atomic().

I also recompiled with --enable-threading=pthreads which didn't change anything. I tried the AMD multithreaded prebuilt library and saw similar behavior, so it's probably not related to my build.

288 seems related but in my case `bli_thread_get_num_threads()` returns the correct number of threads.

Does this ring any bell to you? Could you suggest some directions for debugging this?

devinamatthews commented 4 years ago

BLIS "ignoring" the num threads for DDOT, DGEMV (i.e. time is not reduced when number of threads is increased from 1 to 2 and 4. Also I see in htop that only 1 thread is active)

This is normal; we have not put in multithreading for Level 1 and 2 BLAS operations yet.

time considerably increasing (e.g. an order of magnitude) for DGEMM when I increase the number of threads from 2 to 4.

This sounds to me like the threads might all be occupying the same physical core. Do you have any OpenMP or GOMP thread affinity environment variables set?

devinamatthews commented 4 years ago

I also recompiled with --enable-threading=pthreads which didn't change anything.

I missed this. If this gives the same behavior then it can't be an OpenMP/GOMP thing, but could still be thread affinity. Can you send the output of taskset -a under the same conditions as the BLIS run?

gmargari commented 4 years ago

Yeah I forgot to mention I also tried setting OMP_PLACES=cores and OMP_PROC_BIND=close in case of GOMP (but these shouldn't matter in case of pthreads).

gmargari commented 4 years ago

I'm not familiar with taskset, what should be the CPU mask arg? (I'm trying taskset -a ./myprog <myargs> and get error failed to parse CPU mask: ./myprog)

devinamatthews commented 4 years ago

No arguments other than -a. This should return the current CPU affinity mask.

devinamatthews commented 4 years ago

Nevermind that doesn't do what I think it does. I'll look up how to check it from inside the test program.

devinamatthews commented 4 years ago

Something like this:

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>

int main()
{
    cpu_set_t set;
    sched_getaffinity(0, sizeof(set), &set);
    for (int i = 0;i < CPU_SETSIZE;i++)
    {
        if (CPU_ISSET(i, &set)) printf("%d\n", i);
    }
}

gmargari commented 4 years ago

It prints all logical cores:

vkirangoud commented 4 years ago

Why don't you try numactl -C 0-3 for pinning the threads to the cores 0-3.

gmargari commented 4 years ago

Hi @vkirangoud , thanks for the suggestion. Unfortunately it seems numactl is not available on the machine, and I cannot install software on it.

I'm looking again on this today, it seems that this "weird multithreaded behavior" is now gone, but I must certainly find what caused it so we will not bump into it again (we will be using BLIS on a commercial software). I thought it could may be a wrong/misconfigured OMP env variable as I was playing with them, but env does not show anything right now and I cannot guess what env variables were enabled when that happened. Would this make sense, i.e. could some OMP variable from the ones mentioned in BLIS multithreading doc give this weird behavior (e.g. causing this "overuse" of bli_thrcomm_barrier_atomic() as valgrind reported)?

vkirangoud commented 4 years ago

BLIS library doesn't set any environment variables. That's not problem. if you are running dgemm for small problem sizes - barrier might take more time.

gmargari commented 4 years ago

I'm sorry I probably didn't myself clear: I was the one setting the various variables that BLIS documentation mentions. So I thought maybe I set some values or a combination of values which resulted in this behavior.

The matrix sizes I was testing were large, in the order of 4096 x 4096.

devinamatthews commented 4 years ago

@gmargari I did have this happen on one HPC system when I set environment variables for OMP thread affinity. As I hinted above, the symptom was that all threads ended up running on the same physical core which led to massive oversubscription and constant task switching. I never could figure out exactly what caused the problem (nor could the admins) so I just unset the variables and moved on with my life. I think it's unlikely that it's a BLIS bug because the way that OpenMP/pthreads is used is actually very simple, but I suppose it could be. Please let us know if you ever do find out what the problem is!

jeffhammond commented 4 years ago

It might be worthwhile to add a feature request for Pthread affinity support in BLIS...

fgvanzee commented 4 years ago

@jeffhammond This has been on my long-term radar for some time, but always low priority because I've encountered very few (any?) instances where the environment has pthreads but doesn't have (or doesn't want to use) OpenMP support.

It's also worth reiterating that @gmargari initially observed the same strange behavior of pthreads, so there is some irony in your off-topic suggestion of opening a separate feature request. :laughing:

gmargari commented 4 years ago

@devinamatthews it was good to hear that at least in your case it was a temporary misconfiguration issue and didn't happen again, so I was also about to "move on with my life"© and forget about it. But then as I was trying some last things it happened again. This time I was able to somewhat trace it.

So luckily it does not seem to be a BLIS issue, but an MKL issue with OMP env vars (I was linking against both MKL and BLIS, and I was trying the OMP variables mentioned in BLIS multithreading article). And even though it's probably not a BLIS issue, I thought of documenting it here since other people switching from MKL to BLIS may bump into this (e.g. if, similar to us, they use non-BLAS functions from MKL, such as PARDISO or Sparse BLAS, they'll need to link against both BLIS and MKL). And I would also like to get some feedback.

My minimal test program is this (I'm writing everything below by hand since I cannot copy from internal network):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern "C" void dgemm_(const char *transA, const char *transB, const int *m, const int *n, const int *k, 
                       const double *alpha, const double *A, const int *lda, const double *B, const int *ldb,
                       const double *beta, double *C, const int *ldc);

int main(int argc, char **argv)
{
    if (argc != 2) {
        printf("Syntax: %s size\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    const int n = atoi(argv[1]);
    const double alpha = 1.0;
    const double beta = 1.0;

    double *A = static_cast<double *>(calloc(n * n, sizeof(double)));
    double *B = static_cast<double *>(calloc(n * n, sizeof(double)));
    double *C = static_cast<double *>(calloc(n * n, sizeof(double)));

    dgemm_("N", "N", &n, &n, &n, &alpha, A, &n, B, &n, &beta, C, &n);

    free(A);
    free(B);
    free(C);

    return 0;
}

I compile and run it on i7-6800K (6 physical cores):

g++ test.cpp libmkl_rt.so && time ./a.out 1024
real 62 ms

# add -fopenmp
g++ test.cpp -fopenmp libmkl_rt.so && time ./a.out 1024
real 137 ms

# set OMP_PLACES
g++ test.cpp -fopenmp libmkl_rt.so && time OMP_PLACES=cores ./a.out 1024
real 398 ms

# set OMP_PROC_BIND
g++ test.cpp -fopenmp libmkl_rt.so && time OMP_PROC_BIND=close ./a.out 1024
real 387 ms

# set OMP_PLACES and OMP_PROC_BIND
g++ test.cpp -fopenmp libmkl_rt.so && time OMP_PLACES=cores OMP_PROC_BIND=close ./a.out 1024
real 392 ms

and on Threadripper 1950X (16 physical cores):

g++ test.cpp libmkl_rt.so && time ./a.out 1024
real 85 ms

# add -fopenmp
g++ test.cpp -fopenmp libmkl_rt.so && time ./a.out 1024
real 390 ms

# set OMP_PLACES
g++ test.cpp -fopenmp libmkl_rt.so && time OMP_PLACES=cores ./a.out 1024
real 2915 ms

# set OMP_PROC_BIND
g++ test.cpp -fopenmp libmkl_rt.so && time OMP_PROC_BIND=close ./a.out 1024
real 3056 ms

# set OMP_PLACES and OMP_PROC_BIND
g++ test.cpp -fopenmp libmkl_rt.so && time OMP_PLACES=cores OMP_PROC_BIND=close ./a.out 1024
real 3014 ms

When linking this test program with BLIS I didn't get this behavior.

I'm going to ask on MKL forums and close this issue, but in the meantime would anyone here have any idea what's going on?

fgvanzee commented 4 years ago

@gmargari I'm very glad to hear it was not a problem with BLIS. While it might sound strange to some people, linking to both BLIS and MKL in the same executable is something that we do routinely in our standalone test drivers, and I can't say I've ever observed this behavior before. That said, I never try to set the OMP_PLACES or OMP_PROC_BIND variables when I do so; I only set MKL_NUM_THREADS. I also link against MKL differently, via -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -lpthread -lm -ldl -fopenmp (for multithreading) and -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm -ldl (for sequential).

gmargari commented 4 years ago

Before closing this, was anyone else able to reproduce this issue with env vars?

If not, I should probably research it more on our machines (e.g. GOMP versions etc). If yes, maybe we should put a notifications or something on the Multithreading documentation, because this is why I tried these env vars in the first place.

devinamatthews commented 4 years ago

@gmargari I added a note to the documentation.

flame / blis

Weird multithread behaviour #426

288 seems related but in my case `bli_thread_get_num_threads()` returns the correct number of threads.

flame / blis

Weird multithread behaviour #426

288 seems related but in my case bli_thread_get_num_threads() returns the correct number of threads.

288 seems related but in my case `bli_thread_get_num_threads()` returns the correct number of threads.