Open bgeneto opened 6 years ago
What CPU, which version of OpenBLAS, what matrix size(s) ? Does limiting OPENBLAS_NUM_THREADS further (even to just 2) improve performance ? MKL may simply have a more efficient implementation of ZHEEV than the one from the netlib reference implementation of LAPACK that OpenBLAS uses, or may be better at choosing the appropriate number of threads for your problem size in a BLAS call from some part of ZHEEV. (Can you tell where in ZHEEV the time is spent, or can you provide a code sample that shows the problem ? At a quick glance, netlib ZHEEV calls at least ZHETRD, ZSTEQR and either DSETRF or ZUNGTR, and those four will in turn call other routines...)
I've tested with various CPU families (mostly Nehalem, but also with AMD ryzen/threadripper). The above mentioned aspect (too much time spent with kernel calls) happens with every tested system. In fact OPENBLAS_NUM_THREADS=1 has better performance than any other number of threads, problem size is 200x200 or 300x300. When running with only one thread CPU time is 100% green, with two or more threads the first thread is 100% green and the other ones are mostly red (kernel times) and results in worse performance. I would like to understand why this is happening and how to avoid it (while using multiple threads). Maybe an openmp build of OpenBLAS has better parallelism in this particular case? I can provide a quick example/fortran source later...
kernel time on the "other" threads is probably spent in sched_yield() - either waiting on a lock, or simply waiting for something to do. Which version of OpenBLAS are you using - 0.2.20 or a snapshot of the develop branch ? (The latter has a changed GEMM multithreading which may help)
Once can experiment with YIELDING macro. No idea why but sched_yield there spins CPU in kernel to 100% while having noop there runs cpus nearly idle at no penalty to overall time. I dont remember the issue in the past around it
Issue #900, but previous experiments there have been quite inconclusive. I would not exclude the possibilty that processes spending their time in YIELDING (for whatever implementation of that) is just a symptom and not the issue itself.
I suspect sched_yield became CPU hog at some point, but what it hogs otherwise would go unused...
I wonder if it is same observation as #1544
Observation from #1544 is not quite clear yet, and ARMV7 already has nop
instead of sched_yield
I don't know if the "problem" is related to sched_yield(), I'm afraid I don't have the right tools to check... So instead I provide the example code below so the experts here can profile/debug :-)
Thank you for sample. 1) official doc shortlists BLAS functions that may have wrong multiprocessing thresholds (abovemost diagram, 4-o-clock corner) 2) sched_yield that eats a lot, but experimenting to eliminate it did not give conclusive improvement.
This may in part be a LAPACK issue, recent LAPACK includes an alternative, OpenMP-parallelized version of ZHEEV called ZHEEV_2STAGE that may show better performance (have not gotten around to trying with your example yet, sorry). On the BLAS side, it seems interface/zaxpy.c did not receive the same ("temporary") fix for inefficient multithreading of small problem sizes as interface/axpy.c did (7 years ago, for issue #27). Not sure yet if that is related either...
According to perf
, most of the time (on Kaby Lake mobile hardware at least) appears to be spent in zlasr, with zaxpy playing a minor role (but indeed doing needless multithreading for small sizes). zgemm seems to be more prominent, though from #1320 its behaviour should be quite good already.
The problem is that threads doing nothing but spun up are not accounted in perf, they land as yielding instead. What about adding another thread only when previous sort of utilizes all l3 cache in in+temp+out ? I know it is shared and repartitioned between cores/clusters/whatever on modern CPUs, but at least to start with good approximation. I guess at few cores axpy will saturate memory bw anyway.
Preliminary - changing sched_yield to nop does not directly affect running time, but gets rid of busy waiting that would drive cpu temperature (possibly leading to thermal throttling on poorly designed hardware). Dropping zaxpy to single threading is the only change that leads to a small speedup, while changing the thresholds for multithreading in zgemm, zhemv only reduces performance. As noted above, the majority of the time is spent in unoptimized LAPACK zlasr - for which MKL probably uses a better algorithm than the reference implementation. Also most of the lock/unlock cycles spent in the testcase appear to be from libc's random() used to fill the input matrix. (I ran the testcase 1000 times in a loop to get somewhat better data, but still the ratio between times spent in setup and actual calculation is a bit poor - which probably also explains the huge overhead from creating threads that are hardly needed afterwards.) Probably need to rewrite the testcase first when I find time for this again.
http://www.cs.utexas.edu/users/flame/pubs/flawn60.pdf contains a discussion of the fundamental reasons for the low performance of the zlasr function, and of alternative implementations.
In view of the discussion in #1614, you could try if uncommenting the THREAD_TIMEOUT
option in Makefile.rule and setting its value to 20 before recompiling makes a difference.
You should see some speedup and much less overhead with a current "develop" snapshot now (see #1624). Unfortunately this does not change the low performance of ZLASR itself, and I have now found that the new ZHEEV_2STAGE implementation I suggested earlier does not yet support the JOBZ=V case, i.e. computation of eigenvectors. (The reason for this is not clear to me, the code seems to be in place but is prevented from being called)
Quoting from @fenrus75 from #1614
the sad part is that glibc has a flag you can set on the pthread locks/etc that makes glibc spin an appropriate amount of time normally apps then don't have to do their own spinning on top ;-) 100 msec is forever for spinning though. The other sad part is that a sched_yield() is approximately as expensive as just waking up from a cond_wait() (at least in terms of order of magnitude and the work they do in the process scheduler)
I ended up debugging the same thing on a 24 core Opteron today and came to the same conclusion. Could the THREAD_TIMEOUT maybe be made much smaller? There is probably little harm in spinning a few microseconds. I tried
diff --git a/driver/others/blas_server.c b/driver/others/blas_server.c
index 6f4e2061..0b074646 100644
--- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -143,7 +143,7 @@ typedef struct {
static thread_status_t thread_status[MAX_CPU_NUMBER] _attribute__((aligned(ATTRIBUTE_SIZE)));
#ifndef THREAD_TIMEOUT
-#define THREAD_TIMEOUT 28
+#define THREAD_TIMEOUT 10
#endif
static unsigned int thread_timeout = (1U << (THREAD_TIMEOUT));
which drastically reduced the reduced the number of CPU cycles spent for a simple test case:
24 threads, THREAD_TIMEOUT 10
real 0m46.798s
user 2m5.579s
sys 0m52.336s
24 threads, THREAD_TIMEOUT 28
real 0m47.692s
user 6m27.935s
sys 9m15.834s
single-threaded
real 0m39.774s
user 0m38.020s
sys 0m1.653s
It's probably possible to tune this better, but that simple change would be a good start if it shows no regressions in other tests.
Good point - note that THREAD_TIMEOUT can be overriden in Makefile.rule already so no need to hack the actual code (as long as you are building with make
- this option is not yet available in cmake
builds)
Could you share the test case? Slower with SMP is regression on its own.
Another is - that sched_yield (aka YIELDING macro) is used in a busy loop, some nanosleep could do better instead.
Since this issue is still affecting many libraries/softwares that relies on OpenBLAS, I've created a minimal example file showing the issue. Now that Intel OneAPI is easily available for Linux/WSL2, you can compare the two subroutines (zheev
and zheevr
) performance with ifort+mkl and gfortran+openblas. You will see that the mkl version is not affected by this bug in zheev
(or whatever function it calls). Unfortunately the THREAD_TIMEOUT
minimizes the problem but don't solves it, even when using few threads (four) and a relatively large matrix.
Could you check with threaded and non-threaded OpenBLAS perf record ; perf report
to see which syscall is performed in excess? On Linux, certainly not WSL or XEN.
Just for reference, timings for current develop
on 6c/12t AMD (Ryzen 5-4600H) running Linux:
Intel ZHEEVR 22.231 ZHEEV 64.898 (and no positive effect from setting MKL_DEBUG_CPU_TYPE=5)
GCC ZHEEVR 16.310 ZHEEV 166.612 seconds
gcc + openblas ZHEEVR took: 10.982 seconds (100%) ZHEEV took: 133.658 seconds (1117%)
intel + mkl ZHEEVR took: 10.830 seconds (100%) ZHEEV took: 12.454 seconds (115%)
Relevant perf tool report for gcc+openblas:
Overhead Command Shared Object Symbol
67,60% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zlasr_
17,18% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zhemv_U
7,11% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_kernel_r
3,67% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_kernel_l
1,49% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_incopy
0,47% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] dlaneg_
0,31% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zlar1v_
0,18% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zlarfb_
0,16% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] blas_thread_server
0,15% hermitianEigen libm.so.6 [.] hypot
0,14% hermitianEigen libc.so.6 [.] __sched_yield
0,13% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemv_kernel_4x4
0,12% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zcopy_k
0,11% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_itcopy
0,10% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemv_kernel_4x4
0,10% hermitianEigen [unknown] [k] 0xffffffff87e00158
0,09% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] dlasq5_
0,06% hermitianEigen hermitianEigen [.] MAIN__
0,06% hermitianEigen libgfortran.so.5.0.0 [.] _gfortran_arandom_r8
0,06% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] lsame_
0,06% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] dlartg_
0,05% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] dlamch_
0,04% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_otcopy
0,04% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zsteqr_
0,03% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zaxpy_kernel_4
0,03% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] ztrmm_kernel_RR
0,03% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] ztrmm_kernel_RC
0,03% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] ztrmm_kernel_RN
Relevant perf tool report for intel+mkl:
Overhead Command Shared Object Symbol
42,57% hermitianEigen- libmkl_avx2.so.2 [.] mkl_lapack_ps_avx2_zhemv_nb
29,34% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_zgemm_kernel_0
8,55% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_dgemm_kernel_0
4,05% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_dgemm_dcopy_down12_ea
2,43% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_dtrmm_kernel_rl_0
2,07% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_dtrmm_kernel_ru_0
2,06% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_xdlacpy
1,74% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_zgemm_zccopy_right6_ea
1,19% hermitianEigen- hermitianEigen-ifort [.] for__acquire_semaphore_threaded
1,15% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_dlaneg
0,94% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_zlar1v
0,35% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_xzgemv
0,32% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_dgemm_kernel_0_b0
0,26% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_dlaq6
0,25% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_zgemm_zcopy_down6_ea
0,25% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_xdrotm
0,22% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_zlarfb
0,20% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_zgemm_zcopy_right2_ea
0,20% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_dlasq5
0,16% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_dgemm_kernel_nocopy_NN_b1
0,15% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_xzcopy
0,15% hermitianEigen- hermitianEigen-ifort [.] MAIN__
0,12% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_ztrmm_kernel_ru_0
0,12% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_zgemm_zccopy_down2_ea
0,08% hermitianEigen- hermitianEigen-ifort [.] for_simd_random_number
0,08% hermitianEigen- libmkl_avx2.so.2 [.] mkl_blas_avx2_dgemm_dcopy_right4_ea
0,07% hermitianEigen- libmkl_intel_thread.so.2 [.] mkl_lapack_dlasr3
0,05% hermitianEigen- libmkl_intel_thread.so.2 [.] mkl_lapack_zlatrd
USE_THREAD=0
gcc + openblas (single-threading) ZHEEVR took: 22.923 seconds ZHEEV took: 139.490 seconds (+508%)
Relevant perf tool report for gcc+openblas (single-threading):
Overhead Command Shared Object Symbol
76,16% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zlasr_
9,90% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zhemv_U
6,77% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_kernel_r
3,48% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_kernel_l
1,05% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_incopy
0,54% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] dlaneg_
0,36% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zlar1v_
0,19% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zlarfb_
0,18% hermitianEigen- libm.so.6 [.] hypot
0,13% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zcopy_k
0,12% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemv_kernel_4x4
0,11% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemv_kernel_4x4
0,10% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] dlasq5_
0,07% hermitianEigen- hermitianEigen-nt [.] MAIN__
0,07% hermitianEigen- libgfortran.so.5.0.0 [.] _gfortran_arandom_r8
0,07% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] lsame_
0,07% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] dlartg_
0,06% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_itcopy
0,06% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] dlamch_
0,04% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_otcopy
0,04% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zsteqr_
0,03% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] ztrmm_kernel_RC
0,03% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] ztrmm_kernel_RN
0,03% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] ztrmm_kernel_RR
0,03% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zaxpy_kernel_4
0,03% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] __powidf2
0,02% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] lsame_@plt
0,02% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] dlarrf_
gcc + openblas ZHEEV took: 161.779 seconds
intel + mkl (faster on AMD, evidence of zheev/zlasr openblas issue) ZHEEV took: 41.201 seconds
Relevant perf tool report for gcc+openblas:
Overhead Command Shared Object Symbol
81,37% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zlasr_
8,77% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zhemv_U
5,59% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_kernel_r
1,86% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_kernel_l
0,58% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_incopy
0,18% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zcopy_k
0,13% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemv_kernel_4x4
0,11% hermitianEigen hermitianEigen [.] MAIN__
0,10% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemv_kernel_4x4
0,09% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zlarfb_
0,08% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] dlartg_
0,07% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_itcopy
0,07% hermitianEigen libopenblas_haswellp-r0.3.20.so [.] zgemm_otcopy
0,05% hermitianEigen libgfortran.so.5.0.0 [.] _gfortran_arandom_r8
0,05% hermitianEigen libm.so.6 [.] hypot
Relevant perf tool report for gcc+openblas-st (single-threading with USE_THREAD=0
):
Overhead Command Shared Object Symbol
85,90% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zlasr_
5,34% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zhemv_U
5,31% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_kernel_r
1,64% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_kernel_l
0,51% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemm_incopy
0,18% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zcopy_k
0,10% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zlarfb_
0,09% hermitianEigen- hermitianEigen-nt [.] MAIN__
0,08% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] dlartg_
0,08% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemv_kernel_4x4
0,07% hermitianEigen- libopenblas_non-threaded_haswell-r0.3.20.so [.] zgemv_kernel_4x4
0,06% hermitianEigen- libm.so.6 [.] hypot
Relevant perf tool report for intel+mkl:
Overhead Command Shared Object Symbol
33,74% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_dgemm_pst
14,14% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_zgemm_kernel_0_zen
13,79% hermitianEigen- libmkl_def.so.2 [.] mkl_lapack_ps_def_zhemv_nb
13,48% hermitianEigen- libiomp5.so [.] _INTERNAL92a63c0c::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true>
7,92% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_dgemm_kernel_zen
5,63% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_dgemm_copyan_bdz
3,22% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_zgemm_zccopy_right4_bdz
1,41% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_xdlacpy
1,18% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_dtrmm_inn
0,97% hermitianEigen- hermitianEigen-ifort [.] for_simd_random_number
0,47% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_zgemm_zccopy_down2_bdz
0,46% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_zgemm_zcopy_down4_bdz
0,38% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_dgemm_copybn_bdz
0,35% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_xdrot
0,29% hermitianEigen- libiomp5.so [.] kmp_flag_native<unsigned long long, (flag_type)1, true>::notdone_check
0,23% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_xzcopy
0,12% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_dlaq6
0,12% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_dgemm_mscale
0,12% hermitianEigen- libiomp5.so [.] _INTERNAL92a63c0c::__kmp_hyper_barrier_gather
0,10% hermitianEigen- hermitianEigen-ifort [.] MAIN__
0,10% hermitianEigen- libmkl_def.so.2 [.] mkl_blas_def_ztrmrc
0,10% hermitianEigen- libmkl_core.so.2 [.] mkl_lapack_zlarfb
Looks like MKL may "simply" be using a different implementation of ZHEEV that avoids the expensive call to ZLASR. (Remember that almost all the LAPACK in OpenBLAS is a direct copy of https://github.com/Reference-LAPACK/lapack a.k.a "netlib" - unfortunately nothing there has changed w.r.t the implementation status of ZHEEV_2STAGE compared to my above comment from 2018)
unfortunately nothing there has changed w.r.t the implementation status of ZHEEV_2STAGE compared to my above comment from 2018)
That's really unfortunate! Maybe we should report elsewhere @martin-frbg (any netlib lapack forum?).
see link for their github issue tracker. Old forum archived at https://icl.utk.edu/lapack-forum/ , since then replaced by a google group at https://groups.google.com/a/icl.utk.edu/g/lapack
would probably make sense to rerun your test with pure "netlib" LAPACK and BLAS before reporting there though
gfortran using unoptimized (and single-threaded) Reference-LAPACK (and associated BLAS) on same AMD hardware: ZHEEVR 129.210s ZHEEV 293.950s
Ok, since netlib's lapack exhibits the same behavior, it's really not a bug in openblas but a performance issue with lapack's ZLASR.
It seems that the only consistent (performance wise) routine for the complex hermitian eigenproblem (when requesting all eigenpairs) in lapack/openblas are those using Relatively Robust Representations, that implies only ?HEEVR (CHEEVR/ZHEEVR)... at least until the work on ZHEEV_2STAGE
, ZHEEVD_2STAGE
, and ZHEEVR_2STAGE
gets done/ported in lapack (unfortunately nowadays only eigenvalues can be computed by those 2STAGE routines).
Additionally, do yourself a favor and explicitly set driver='evr'
while using Python numpy/scipy with openblas (otherwise the comparison with mkl is unfair due to the issue reported here):
from scipy import linalg as LA
...
w, v = LA.eigh(mat, driver='evr')
Agreed. Unfortunately there is no sign of ongoing work on the 2stage codes since their inclusion. The FLAME group paper I linked above four years ago at least sketches a much faster implementation of what zlasr does, but actually coding it looks non-trivial
Created a LAPACK ticket to inquire about implementation status.
Hi!
In comparing OpenBLAS performance with Intel MKL I've noticed that (at least in my particular case: real or hermitian eigenvalue problem, e.g. ZHEEV) OpenBLAS is consuming too much more kernel times (red bars in htop) than Intel MKL and maybe this is why it is so slow (three to five times slower, depending on matrix size) compared to MKL. Does anybody know what is causing so much kernel threads/time and how to avoid it? I've already limited OPENBLAS_NUM_THREADS to 4 or 8... TIA.