`cblas_sbgemm` extremely slow on WSL & AMD CPU

moderato commented 5 months ago

Hello, I'm trying to run cblas_sbgemm on WSL & AMD CPU but find it extremely slow, e.g. 30x slower than cblas_sgemm. Anyone knows how to debug and solve this issue?

Related issue (run on Anaconda Prompt): https://github.com/OpenMathLib/OpenBLAS/issues/4672

System info:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 7640HS w/ Radeon 760M Graphics
    CPU family:          25
    Model:               116
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            8583.32
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse ss
                         e2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nons
                         top_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx
                          f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch o
                         svw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid
                          avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl
                         xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_
                         clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_
                         vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid
Virtualization features:
  Virtualization:        AMD-V
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    6 MiB (6 instances)
  L3:                    16 MiB (1 instance)
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; safe RET, no microcode
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS
                         Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Build command:

cmake -DBUILD_BFLOAT16=ON -DBUILD_WITHOUT_LAPACK=yes -DNOFORTRAN=1 ..
make -j16
make install

martin-frbg commented 5 months ago

it's not quite clear to me here how you got around the internal compiler error mentioned in your other ticket ? and did you get a library with "zen" in the name (meaning autodetection of cpu type worked) - or if you did a DYNAMIC_ARCH build, what cpu type gets reported when you set OPENBLAS_VERBOSE=2 in the environment ?

moderato commented 5 months ago

it's not quite clear to me here how you got around the internal compiler error mentioned in your other ticket ? and did you get a library with "zen" in the name (meaning autodetection of cpu type worked) - or if you did a DYNAMIC_ARCH build, what cpu type gets reported when you set OPENBLAS_VERBOSE=2 in the environment ?

Thanks for the reply. With either autodetection or DYNAMIC_ARCH turned on on WSL I got "cooperlake" instead of "zen" which is weird. On WSL at least it built, while with Anaconda Prompt it failed like I wrote in the other ticket.

Looks like When I use the built OpenBLAS library to build my code with cblas_sbgemm no AVX-512 related instructions are included in the object file.

martin-frbg commented 5 months ago

Cooperlake would currently be correct for Zen4 (to make use of the AVX512BF16 instructions) but no AVX512 seen in the build is suspicious. I would think LLVM supports it, only a plain VS build would use slower C codes for everything

martin-frbg commented 5 months ago

As far as I can tell, the AVX512 code paths should be taken (unless there is an error in your input data that gets caught in the interface/gemm.c code before calling the actual BLAS kernel for SBGEMM). Unfortunately I won't have access to Ryzen4 hardware until the weekend. Does the "test_sbgemm" executable in the test folder (comparing SGEMM and SBGEMM results) work for you without raising an error ?

martin-frbg commented 5 months ago

BTW you could build for TARGET=ZEN (or set OPENBLAS_CORETYPE=ZEN for a DYNAMIC_ARCH build at runtime) to get non-AVX512 codes for comparison, but even if the AVX512_BF16 implementation on Zen4 was a lot less performant than on Intel Cooperlake I doubt that the penalty would amount to 30x. Hard to guess what else could be wrong though

martin-frbg commented 5 months ago

So on Zen4 under plain Linux SBGEMM and SGEMM show basically equal performance according to my tests. When AVX512 is not available however, fallback to the generic C kernel for SBGEMM causes performance to suck a lot more than I remembered. This is probably what you are seeing in your WSL setup - either because AVX512 assembler kernels were not compiled in, or the hypervisor blocks/slows accesses to the AVX512 hardware

moderato commented 5 months ago

As far as I can tell, the AVX512 code paths should be taken (unless there is an error in your input data that gets caught in the interface/gemm.c code before calling the actual BLAS kernel for SBGEMM). Unfortunately I won't have access to Ryzen4 hardware until the weekend. Does the "test_sbgemm" executable in the test folder (comparing SGEMM and SBGEMM results) work for you without raising an error ?

Yes test_sbgemm works well functionally. I added a timing function there and the performance gap is still big. Here's the compilation command and the result in second:

cc -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DC_LAPACK -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=12 -DMAX_PARALLEL_NUMBER=1 -DBUILD_BFLOAT16 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.27\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -march=cooperlake -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I..  -o test_sbgemm compare_sgemm_sbgemm.c ../libopenblas_cooperlakep-r0.3.27.a -lm -lpthread -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../..  -lc
fp32: 0.055850, bf16: 0.557987

Seems like with -DBUILD_BFLOAT16 AVX512 is still not enabled as I didn't see any -m flag that enables it. Any thoughts?

martin-frbg commented 5 months ago

With LLVM 18 (as per your other issue) there should be a -march=cooperlake that normally implies AVX512. (My test under Linux was done with GCC however, and the timings for both SGEMM and SBGEMM in the modified test_sbgemm were in the 0.0002s range)

moderato commented 5 months ago

So I tried adding a bunch of AVX512 related flags (-mavx512f -mavx512bf16 -mavx512pf -mavx512er -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg) together with -march=cooperlake to both the build commands of OpenBLAS and my own code. Running objdump -xD --demangle test_executable | grep zmm prints instructions containing zmm now, meaning building with bf16 seems to be working.

However the performance gap is still more or less the same... Really don't understand where the problem is.

moderato commented 5 months ago

With LLVM 18 (as per your other issue) there should be a -march=cooperlake that normally implies AVX512. (My test under Linux was done with GCC however, and the timings for both SGEMM and SBGEMM in the modified test_sbgemm were in the 0.0002s range)

May I ask what the AVX512 instructions corresponding to sbgemm in your executable are? I saw there are three BF16-related instructions in AVX512. I only have vcvtneps2bf16 in my side, but I believe VDPBF16PS is the one that does the compute. Does your executable have that instruction?

martin-frbg commented 5 months ago

I think it simply boils down to whether your build uses the sbgemm_kernel_16x4_cooperlake.c from kernel/x86_64 (which uses intrinsics from immintrin.h) or not.

moderato commented 5 months ago

I think it simply boils down to whether your build uses the sbgemm_kernel_16x4_cooperlake.c from kernel/x86_64 (which uses intrinsics from immintrin.h) or not.

I see. How do I check that when I build OpenBLAS? Or, how do I enforce that in the following build command to quickly test it with test_sbgemm?

cc -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DC_LAPACK -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=12 -DMAX_PARALLEL_NUMBER=1 -DBUILD_BFLOAT16 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.27\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -march=cooperlake -mavx512f -mavx512bf16 -mavx512pf -mavx512er -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I..  -o test_sbgemm compare_sgemm_sbgemm.c ../libopenblas_cooperlakep-r0.3.27.a -lm -lpthread -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../..  -lc

martin-frbg commented 5 months ago

check if you see sbgemm_kernel_16x4_cooperlake.c in the build log for OpenBLAS (you may need to set CMAKE_VERBOSE_MAKEFILES to get complete build output). Or look in the kernel.vcxproj underneath your build folder, I think it should contain the full name of the source file to use for what becomes the sbgemm_kernel object

moderato commented 5 months ago

OK thanks. I rebuilt it and actually didn't see sbgemm_kernel_16x4_cooperlake.c coming up in the log. Is there any way to enforce it?

martin-frbg commented 5 months ago

Can you upload the log please, or at least the config.h that was generated in your build folder ? Having COOPERLAKE as the build target should be enough to enforce it, unless your compiler does not support AVX512 or lacks the immintrin.h header file.

moderato commented 5 months ago

Here is the config.h and build log. Looks like sbgemm_kernel is not built with AVX512 nor sbgemm_kernel_16x4_cooperlake.c.

#define OS_LINUX    1
#define ARCH_X86_64 1
#define C_GCC   1
#define __64BIT__   1
#define FUNDERSCORE 
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define COOPERLAKE
#define L1_CODE_SIZE 32768
#define L1_CODE_ASSOCIATIVE 8
#define L1_CODE_LINESIZE 64
#define L1_DATA_SIZE 32768
#define L1_DATA_ASSOCIATIVE 8
#define L1_DATA_LINESIZE 64
#define L2_SIZE 1048576
#define L2_ASSOCIATIVE 8
#define L2_LINESIZE 64
#define ITB_SIZE 4096
#define ITB_ASSOCIATIVE 0
#define ITB_ENTRIES 64
#define DTB_SIZE 4096
#define DTB_ASSOCIATIVE 0
#define DTB_DEFAULT_ENTRIES 72
#define HAVE_CMOV
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define HAVE_SSSE3
#define HAVE_SSE4_1
#define HAVE_SSE4_2
#define HAVE_SSE4A
#define HAVE_AVX
#define HAVE_AVX2
#define HAVE_AVX512VL
#define HAVE_AVX512BF16
#define HAVE_FMA3
#define HAVE_CFLUSH
#define HAVE_MISALIGNSSE
#define HAVE_FASTMOVU
#define NUM_SHAREDCACHE 1
#define NUM_CORES 1
#define CORE_COOPERLAKE
#define CHAR_CORENAME "COOPERLAKE"
#define SLOCAL_BUFFER_SIZE  20480
#define DLOCAL_BUFFER_SIZE  12288
#define CLOCAL_BUFFER_SIZE  12288
#define ZLOCAL_BUFFER_SIZE  8192
#define GEMM_MULTITHREAD_THRESHOLD  4

build.txt

moderato commented 5 months ago

And here's the truncated log for cmake configuration. The original was too big to be uploaded so I just truncate the sbgemm related part. Please let me know if you need anything from the rest. Thanks!

trace_truncated.txt

martin-frbg commented 5 months ago

At least the config has COOPERLAKE and HASAVX512BF16 as it should. Can you check/show what's in /mnt/c/Users/moderato/Documents/repos/OpenBLAS/build/linux/kernel/CMakeFiles/sbgemm_kernel.c please ? This should either have the optimized cooperlake kernel or the generic 2x2 one on its last line...

moderato commented 5 months ago

At least the config has COOPERLAKE and HASAVX512BF16 as it should. Can you check/show what's in /mnt/c/Users/moderato/Documents/repos/OpenBLAS/build/linux/kernel/CMakeFiles/sbgemm_kernel.c please ? This should either have the optimized cooperlake kernel or the generic 2x2 one on its last line...

Sorry for the late reply, was sick in the past few days. It's the generic 2x2 as it shows...

martin-frbg commented 5 months ago

Hmm. Gmake build (build.txt) inexplicably terminated with an undefined macro although it is obviously present at the end of the config.h you posted - the latter file is probably from the cmake build attempt ? But I cannot tell much from truncated.txt - which version of gcc are you using in these builds ? No indication so far for why it went for the fallback kernel after apparently recognizing Cooperlake target and AVX512BF16 capability.

moderato commented 5 months ago

Hmm. Gmake build (build.txt) inexplicably terminated with an undefined macro although it is obviously present at the end of the config.h you posted - the latter file is probably from the cmake build attempt ? But I cannot tell much from truncated.txt - which version of gcc are you using in these builds ? No indication so far for why it went for the fallback kernel after apparently recognizing Cooperlake target and AVX512BF16 capability.

Ah sorry I just realized build.txt is a failed log. Here's the updated one. Also my gcc version is 11.4.0.

build.txt

martin-frbg commented 5 months ago

Unfortunately that does not tell me anything new, as all choices have already been made at this point. Can you redirect the output of the initial cmake run to a file please ? gcc 11.4 should be recent enough to support AVX512BF16 (and in particular the _mm512_dpbf16_ps instruction that is used in a code snippet in c_check to test if the compiler supports it). And I do not know of any limitations regarding AVX512BF16 in WSL - basically I think this should behave like a Linux build

moderato commented 5 months ago

Unfortunately that does not tell me anything new, as all choices have already been made at this point. Can you redirect the output of the initial cmake run to a file please ? gcc 11.4 should be recent enough to support AVX512BF16 (and in particular the _mm512_dpbf16_ps instruction that is used in a code snippet in c_check to test if the compiler supports it). And I do not know of any limitations regarding AVX512BF16 in WSL - basically I think this should behave like a Linux build

build.txt is the output of the command cmake --build . (I replace make with it) and the previous trace_truncated.txt is the output of the command cmake ... Please let me know if these are not what you need.

In the related issue I closed #4672 I mentioned I also observed the failure of building with BF16 on Windows, so to me this looks like a generic problem that is platform independent. The weird thing turns out to be that many cc commands run with these flags -m64 -march=cooperlake -mavx2 -mavx -msse -msse2 -msse3 -mssse3 -msse4.1. I would suppose a correct build should include those avx512 related flags here. How (and maybe where) are these flags specified? And is there a chance we can enforce something here?

martin-frbg commented 5 months ago

The -march=cooperlake covers all the AVX512 related flags already. This gets specified in cmake/cc.cmake when autodetection (or TARGET specification) produced COOPERLAKE. Can you please provide a non-truncated output of cmake .. without the "trace" setting ?

moderato commented 5 months ago

The -march=cooperlake covers all the AVX512 related flags already. This gets specified in cmake/cc.cmake when autodetection (or TARGET specification) produced COOPERLAKE. Can you please provide a non-truncated output of cmake .. without the "trace" setting ?

Sure: config.txt

martin-frbg commented 5 months ago

Hmm, looks perfectly normal. There appears to be something going wrong with the handling of the ifneq...endif conditional in kernel/x86_64/KERNEL.COOPERLAKE - I suspect it will build with the correct SBGEMM kernel if you remove the two lines.

moderato commented 5 months ago

Thank you Martin! This PR perfectly fixes the problem and now I can see a ~2x speedup of BF16 GEMM against FP32.

Really appreciate it.

martin-frbg commented 5 months ago

Sorry it took me a while to understand the source of the problem.

moderato commented 5 months ago

Definitely no need to be sorry Martin. This seems to be a deep issue and you've been so helpful and providing so much useful guidance all the way. Thank you!

BoruiXu commented 4 months ago

Hi, I encounter the same problem. I compile the OpenBLAS 0.3.27 using gcc13 and run the cblas_sbgemm on the same CPU AMD 7640hs, and the system is Ubuntu22.04. Compared with fp32, the bfloat16 matrix multiplication is about 4x slower.

I also try to remove the relevant two lines in "KERNEL.COOPERLAKE", but it does not work...

martin-frbg commented 4 months ago

Please do not remove the lines in KERNEL.COOPERLAKE, you need to fix the cmake script that reads them. (That is, apply PR #4695 or simply copy the contents of the file cmake/utils.cmake from the github view of the develop branch)

BoruiXu commented 4 months ago

Thanks Martin. Actually, I have already tried the newest develop branch to compile OpenBLAS without any modification. However, I got the same result: cblas_sgemm is still 4x faster than cblas_sbgemm.

And I just compile it with the same parameters as above.

cmake -DBUILD_BFLOAT16=ON -DBUILD_WITHOUT_LAPACK=yes -DNOFORTRAN=1 .. 
make -j
make install

martin-frbg commented 4 months ago

I don not think anything changed since then that could have broke this again. Can you please check that your build actually uses sbgemm_kernel_16x4_cooperlake.c as the sbgemm_kernel source ?

BoruiXu commented 4 months ago

Yes, It does. In the file /kernel/CMakeFiles/sbgemm_kernel.c, it shows the sbgemm_kernel_16x4_cooperlake.c in the last line.

Maybe something wrong with my system? I ever installed version 0.3.20 using apt, but have already uninstalled it. When compiling, I only include the new version. I do not know if this caused the poor performance. It is weird...

Thanks Martin.

OpenMathLib / OpenBLAS

`cblas_sbgemm` extremely slow on WSL & AMD CPU #4673

System info:

Build command: