Above average kernel times causing slow performance

bgeneto commented 6 years ago

Hi!

In comparing OpenBLAS performance with Intel MKL I've noticed that (at least in my particular case: real or hermitian eigenvalue problem, e.g. ZHEEV) OpenBLAS is consuming too much more kernel times (red bars in htop) than Intel MKL and maybe this is why it is so slow (three to five times slower, depending on matrix size) compared to MKL. Does anybody know what is causing so much kernel threads/time and how to avoid it? I've already limited OPENBLAS_NUM_THREADS to 4 or 8... TIA.

martin-frbg commented 6 years ago

What CPU, which version of OpenBLAS, what matrix size(s) ? Does limiting OPENBLAS_NUM_THREADS further (even to just 2) improve performance ? MKL may simply have a more efficient implementation of ZHEEV than the one from the netlib reference implementation of LAPACK that OpenBLAS uses, or may be better at choosing the appropriate number of threads for your problem size in a BLAS call from some part of ZHEEV. (Can you tell where in ZHEEV the time is spent, or can you provide a code sample that shows the problem ? At a quick glance, netlib ZHEEV calls at least ZHETRD, ZSTEQR and either DSETRF or ZUNGTR, and those four will in turn call other routines...)

bgeneto commented 6 years ago

I've tested with various CPU families (mostly Nehalem, but also with AMD ryzen/threadripper). The above mentioned aspect (too much time spent with kernel calls) happens with every tested system. In fact OPENBLAS_NUM_THREADS=1 has better performance than any other number of threads, problem size is 200x200 or 300x300. When running with only one thread CPU time is 100% green, with two or more threads the first thread is 100% green and the other ones are mostly red (kernel times) and results in worse performance. I would like to understand why this is happening and how to avoid it (while using multiple threads). Maybe an openmp build of OpenBLAS has better parallelism in this particular case? I can provide a quick example/fortran source later...

martin-frbg commented 6 years ago

kernel time on the "other" threads is probably spent in sched_yield() - either waiting on a lock, or simply waiting for something to do. Which version of OpenBLAS are you using - 0.2.20 or a snapshot of the develop branch ? (The latter has a changed GEMM multithreading which may help)

brada4 commented 6 years ago

Once can experiment with YIELDING macro. No idea why but sched_yield there spins CPU in kernel to 100% while having noop there runs cpus nearly idle at no penalty to overall time. I dont remember the issue in the past around it

martin-frbg commented 6 years ago

Issue #900, but previous experiments there have been quite inconclusive. I would not exclude the possibilty that processes spending their time in YIELDING (for whatever implementation of that) is just a symptom and not the issue itself.

brada4 commented 6 years ago

I suspect sched_yield became CPU hog at some point, but what it hogs otherwise would go unused...

brada4 commented 6 years ago

I wonder if it is same observation as #1544

martin-frbg commented 6 years ago

Observation from #1544 is not quite clear yet, and ARMV7 already has nop instead of sched_yield

bgeneto commented 6 years ago

I don't know if the "problem" is related to sched_yield(), I'm afraid I don't have the right tools to check... So instead I provide the example code below so the experts here can profile/debug :-)

zheev-example

brada4 commented 6 years ago

Thank you for sample. 1) official doc shortlists BLAS functions that may have wrong multiprocessing thresholds (abovemost diagram, 4-o-clock corner) 2) sched_yield that eats a lot, but experimenting to eliminate it did not give conclusive improvement.

martin-frbg commented 6 years ago

This may in part be a LAPACK issue, recent LAPACK includes an alternative, OpenMP-parallelized version of ZHEEV called ZHEEV_2STAGE that may show better performance (have not gotten around to trying with your example yet, sorry). On the BLAS side, it seems interface/zaxpy.c did not receive the same ("temporary") fix for inefficient multithreading of small problem sizes as interface/axpy.c did (7 years ago, for issue #27). Not sure yet if that is related either...

martin-frbg commented 6 years ago

According to perf, most of the time (on Kaby Lake mobile hardware at least) appears to be spent in zlasr, with zaxpy playing a minor role (but indeed doing needless multithreading for small sizes). zgemm seems to be more prominent, though from #1320 its behaviour should be quite good already.

brada4 commented 6 years ago

The problem is that threads doing nothing but spun up are not accounted in perf, they land as yielding instead. What about adding another thread only when previous sort of utilizes all l3 cache in in+temp+out ? I know it is shared and repartitioned between cores/clusters/whatever on modern CPUs, but at least to start with good approximation. I guess at few cores axpy will saturate memory bw anyway.

martin-frbg commented 6 years ago

Preliminary - changing sched_yield to nop does not directly affect running time, but gets rid of busy waiting that would drive cpu temperature (possibly leading to thermal throttling on poorly designed hardware). Dropping zaxpy to single threading is the only change that leads to a small speedup, while changing the thresholds for multithreading in zgemm, zhemv only reduces performance. As noted above, the majority of the time is spent in unoptimized LAPACK zlasr - for which MKL probably uses a better algorithm than the reference implementation. Also most of the lock/unlock cycles spent in the testcase appear to be from libc's random() used to fill the input matrix. (I ran the testcase 1000 times in a loop to get somewhat better data, but still the ratio between times spent in setup and actual calculation is a bit poor - which probably also explains the huge overhead from creating threads that are hardly needed afterwards.) Probably need to rewrite the testcase first when I find time for this again.

martin-frbg commented 6 years ago

http://www.cs.utexas.edu/users/flame/pubs/flawn60.pdf contains a discussion of the fundamental reasons for the low performance of the zlasr function, and of alternative implementations.

martin-frbg commented 6 years ago

In view of the discussion in #1614, you could try if uncommenting the THREAD_TIMEOUT option in Makefile.rule and setting its value to 20 before recompiling makes a difference.

martin-frbg commented 6 years ago

You should see some speedup and much less overhead with a current "develop" snapshot now (see #1624). Unfortunately this does not change the low performance of ZLASR itself, and I have now found that the new ZHEEV_2STAGE implementation I suggested earlier does not yet support the JOBZ=V case, i.e. computation of eigenvectors. (The reason for this is not clear to me, the code seems to be in place but is prevented from being called)

arndb commented 4 years ago

Quoting from @fenrus75 from #1614

the sad part is that glibc has a flag you can set on the pthread locks/etc that makes glibc spin an appropriate amount of time normally apps then don't have to do their own spinning on top ;-) 100 msec is forever for spinning though. The other sad part is that a sched_yield() is approximately as expensive as just waking up from a cond_wait() (at least in terms of order of magnitude and the work they do in the process scheduler)

I ended up debugging the same thing on a 24 core Opteron today and came to the same conclusion. Could the THREAD_TIMEOUT maybe be made much smaller? There is probably little harm in spinning a few microseconds. I tried


diff --git a/driver/others/blas_server.c b/driver/others/blas_server.c
index 6f4e2061..0b074646 100644
--- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -143,7 +143,7 @@ typedef struct {
 static thread_status_t thread_status[MAX_CPU_NUMBER] _attribute__((aligned(ATTRIBUTE_SIZE)));

 #ifndef THREAD_TIMEOUT
-#define THREAD_TIMEOUT 28
+#define THREAD_TIMEOUT 10
 #endif

 static unsigned int thread_timeout = (1U << (THREAD_TIMEOUT));

which drastically reduced the reduced the number of CPU cycles spent for a simple test case:

24 threads, THREAD_TIMEOUT 10

real    0m46.798s
user    2m5.579s
sys     0m52.336s

24 threads, THREAD_TIMEOUT 28

real    0m47.692s
user    6m27.935s
sys     9m15.834s

single-threaded

real    0m39.774s
user    0m38.020s
sys     0m1.653s

It's probably possible to tune this better, but that simple change would be a good start if it shows no regressions in other tests.

martin-frbg commented 4 years ago

Good point - note that THREAD_TIMEOUT can be overriden in Makefile.rule already so no need to hack the actual code (as long as you are building with make - this option is not yet available in cmake builds)

brada4 commented 4 years ago

Could you share the test case? Slower with SMP is regression on its own.

Another is - that sched_yield (aka YIELDING macro) is used in a busy loop, some nanosleep could do better instead.

bgeneto commented 2 years ago

Since this issue is still affecting many libraries/softwares that relies on OpenBLAS, I've created a minimal example file showing the issue. Now that Intel OneAPI is easily available for Linux/WSL2, you can compare the two subroutines (zheev and zheevr) performance with ifort+mkl and gfortran+openblas. You will see that the mkl version is not affected by this bug in zheev (or whatever function it calls). Unfortunately the THREAD_TIMEOUT minimizes the problem but don't solves it, even when using few threads (four) and a relatively large matrix.

brada4 commented 2 years ago

Could you check with threaded and non-threaded OpenBLAS perf record ; perf report to see which syscall is performed in excess? On Linux, certainly not WSL or XEN.

martin-frbg commented 2 years ago

Just for reference, timings for current develop on 6c/12t AMD (Ryzen 5-4600H) running Linux: Intel ZHEEVR 22.231 ZHEEV 64.898 (and no positive effect from setting MKL_DEBUG_CPU_TYPE=5) GCC ZHEEVR 16.310 ZHEEV 166.612 seconds

bgeneto commented 2 years ago

OpenBLAS v0.3.20
Linux pop-os 5.18.10
12th Gen Intel(R) Core(TM) i5-12600K (limited to 4 threads via env vars)
Problem/matrix size/shape: 4096 x 4096

gcc + openblas ZHEEVR took: 10.982 seconds (100%) ZHEEV took: 133.658 seconds (1117%)

intel + mkl ZHEEVR took: 10.830 seconds (100%) ZHEEV took: 12.454 seconds (115%)

Relevant perf tool report for gcc+openblas:

Overhead  Command         Shared Object                    Symbol
  67,60%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlasr_
  17,18%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zhemv_U
   7,11%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_r
   3,67%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_l
   1,49%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_incopy
   0,47%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlaneg_
   0,31%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlar1v_
   0,18%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlarfb_
   0,16%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] blas_thread_server
   0,15%  hermitianEigen  libm.so.6                        [.] hypot
   0,14%  hermitianEigen  libc.so.6                        [.] __sched_yield
   0,13%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,12%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zcopy_k
   0,11%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_itcopy
   0,10%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,10%  hermitianEigen  [unknown]                        [k] 0xffffffff87e00158
   0,09%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlasq5_
   0,06%  hermitianEigen  hermitianEigen                   [.] MAIN__
   0,06%  hermitianEigen  libgfortran.so.5.0.0             [.] _gfortran_arandom_r8
   0,06%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] lsame_
   0,06%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlartg_
   0,05%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlamch_
   0,04%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_otcopy
   0,04%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zsteqr_
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zaxpy_kernel_4
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] ztrmm_kernel_RR
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] ztrmm_kernel_RC
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] ztrmm_kernel_RN

Relevant perf tool report for intel+mkl:

Overhead  Command          Shared Object             Symbol
  42,57%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_lapack_ps_avx2_zhemv_nb
  29,34%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_kernel_0
   8,55%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_kernel_0
   4,05%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_dcopy_down12_ea
   2,43%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dtrmm_kernel_rl_0
   2,07%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dtrmm_kernel_ru_0
   2,06%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_xdlacpy
   1,74%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zccopy_right6_ea
   1,19%  hermitianEigen-  hermitianEigen-ifort      [.] for__acquire_semaphore_threaded
   1,15%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlaneg
   0,94%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_zlar1v
   0,35%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_xzgemv
   0,32%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_kernel_0_b0
   0,26%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlaq6
   0,25%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zcopy_down6_ea
   0,25%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_xdrotm
   0,22%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_zlarfb
   0,20%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zcopy_right2_ea
   0,20%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlasq5
   0,16%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_kernel_nocopy_NN_b1
   0,15%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_xzcopy
   0,15%  hermitianEigen-  hermitianEigen-ifort      [.] MAIN__
   0,12%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_ztrmm_kernel_ru_0
   0,12%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zccopy_down2_ea
   0,08%  hermitianEigen-  hermitianEigen-ifort      [.] for_simd_random_number
   0,08%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_dcopy_right4_ea
   0,07%  hermitianEigen-  libmkl_intel_thread.so.2  [.] mkl_lapack_dlasr3
   0,05%  hermitianEigen-  libmkl_intel_thread.so.2  [.] mkl_lapack_zlatrd

Same config as above but now using a non-threaded OpenBLAS version built with USE_THREAD=0

gcc + openblas (single-threading) ZHEEVR took: 22.923 seconds ZHEEV took: 139.490 seconds (+508%)

Relevant perf tool report for gcc+openblas (single-threading):

Overhead  Command          Shared Object                                Symbol
  76,16%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlasr_
   9,90%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zhemv_U
   6,77%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_r
   3,48%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_l
   1,05%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_incopy
   0,54%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlaneg_
   0,36%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlar1v_
   0,19%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlarfb_
   0,18%  hermitianEigen-  libm.so.6                                    [.] hypot
   0,13%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zcopy_k
   0,12%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,11%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,10%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlasq5_
   0,07%  hermitianEigen-  hermitianEigen-nt                            [.] MAIN__
   0,07%  hermitianEigen-  libgfortran.so.5.0.0                         [.] _gfortran_arandom_r8
   0,07%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] lsame_
   0,07%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlartg_
   0,06%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_itcopy
   0,06%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlamch_
   0,04%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_otcopy
   0,04%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zsteqr_
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] ztrmm_kernel_RC
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] ztrmm_kernel_RN
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] ztrmm_kernel_RR
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zaxpy_kernel_4
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] __powidf2
   0,02%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] lsame_@plt
   0,02%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlarrf_

bgeneto commented 2 years ago

Report for AMD Ryzen 5 5600G running only ZHEEV

OpenBLAS v0.3.20 (ZHEEV only, no ZHEEVR)
Linux pop-os 5.18.10
AMD Ryzen 5 5600G (limited to 4 threads via env vars)
Problem/matrix size/shape: 4096 x 4096

gcc + openblas ZHEEV took: 161.779 seconds

intel + mkl (faster on AMD, evidence of zheev/zlasr openblas issue) ZHEEV took: 41.201 seconds

Relevant perf tool report for gcc+openblas:

Overhead  Command         Shared Object                    Symbol
  81,37%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlasr_
   8,77%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zhemv_U
   5,59%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_r
   1,86%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_l
   0,58%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_incopy
   0,18%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zcopy_k
   0,13%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,11%  hermitianEigen  hermitianEigen                   [.] MAIN__
   0,10%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,09%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlarfb_
   0,08%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlartg_
   0,07%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_itcopy
   0,07%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_otcopy
   0,05%  hermitianEigen  libgfortran.so.5.0.0             [.] _gfortran_arandom_r8
   0,05%  hermitianEigen  libm.so.6                        [.] hypot

Relevant perf tool report for gcc+openblas-st (single-threading with USE_THREAD=0):

Overhead  Command          Shared Object                                Symbol
  85,90%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlasr_
   5,34%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zhemv_U
   5,31%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_r
   1,64%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_l
   0,51%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_incopy
   0,18%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zcopy_k
   0,10%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlarfb_
   0,09%  hermitianEigen-  hermitianEigen-nt                            [.] MAIN__
   0,08%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlartg_
   0,08%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,07%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,06%  hermitianEigen-  libm.so.6                                    [.] hypot

Relevant perf tool report for intel+mkl:

Overhead  Command          Shared Object             Symbol
  33,74%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_pst
  14,14%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_kernel_0_zen
  13,79%  hermitianEigen-  libmkl_def.so.2           [.] mkl_lapack_ps_def_zhemv_nb
  13,48%  hermitianEigen-  libiomp5.so               [.] _INTERNAL92a63c0c::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true>
   7,92%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_kernel_zen
   5,63%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_copyan_bdz
   3,22%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_zccopy_right4_bdz
   1,41%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_xdlacpy
   1,18%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dtrmm_inn
   0,97%  hermitianEigen-  hermitianEigen-ifort      [.] for_simd_random_number
   0,47%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_zccopy_down2_bdz
   0,46%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_zcopy_down4_bdz
   0,38%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_copybn_bdz
   0,35%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_xdrot
   0,29%  hermitianEigen-  libiomp5.so               [.] kmp_flag_native<unsigned long long, (flag_type)1, true>::notdone_check
   0,23%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_xzcopy
   0,12%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlaq6
   0,12%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_mscale
   0,12%  hermitianEigen-  libiomp5.so               [.] _INTERNAL92a63c0c::__kmp_hyper_barrier_gather
   0,10%  hermitianEigen-  hermitianEigen-ifort      [.] MAIN__
   0,10%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_ztrmrc
   0,10%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_zlarfb

martin-frbg commented 2 years ago

Looks like MKL may "simply" be using a different implementation of ZHEEV that avoids the expensive call to ZLASR. (Remember that almost all the LAPACK in OpenBLAS is a direct copy of https://github.com/Reference-LAPACK/lapack a.k.a "netlib" - unfortunately nothing there has changed w.r.t the implementation status of ZHEEV_2STAGE compared to my above comment from 2018)

bgeneto commented 2 years ago

unfortunately nothing there has changed w.r.t the implementation status of ZHEEV_2STAGE compared to my above comment from 2018)

That's really unfortunate! Maybe we should report elsewhere @martin-frbg (any netlib lapack forum?).

martin-frbg commented 2 years ago

see link for their github issue tracker. Old forum archived at https://icl.utk.edu/lapack-forum/ , since then replaced by a google group at https://groups.google.com/a/icl.utk.edu/g/lapack

martin-frbg commented 2 years ago

would probably make sense to rerun your test with pure "netlib" LAPACK and BLAS before reporting there though

martin-frbg commented 2 years ago

gfortran using unoptimized (and single-threaded) Reference-LAPACK (and associated BLAS) on same AMD hardware: ZHEEVR 129.210s ZHEEV 293.950s

bgeneto commented 2 years ago

Ok, since netlib's lapack exhibits the same behavior, it's really not a bug in openblas but a performance issue with lapack's ZLASR.

It seems that the only consistent (performance wise) routine for the complex hermitian eigenproblem (when requesting all eigenpairs) in lapack/openblas are those using Relatively Robust Representations, that implies only ?HEEVR (CHEEVR/ZHEEVR)... at least until the work on ZHEEV_2STAGE, ZHEEVD_2STAGE, and ZHEEVR_2STAGE gets done/ported in lapack (unfortunately nowadays only eigenvalues can be computed by those 2STAGE routines).

Additionally, do yourself a favor and explicitly set driver='evr' while using Python numpy/scipy with openblas (otherwise the comparison with mkl is unfair due to the issue reported here):

from scipy import linalg as LA
...
w, v = LA.eigh(mat, driver='evr')

martin-frbg commented 2 years ago

Agreed. Unfortunately there is no sign of ongoing work on the 2stage codes since their inclusion. The FLAME group paper I linked above four years ago at least sketches a much faster implementation of what zlasr does, but actually coding it looks non-trivial

martin-frbg commented 2 years ago

Created a LAPACK ticket to inquire about implementation status.

OpenMathLib / OpenBLAS

Above average kernel times causing slow performance #1560

Report for AMD Ryzen 5 5600G running only ZHEEV