Closed moderato closed 5 months ago
it's not quite clear to me here how you got around the internal compiler error mentioned in your other ticket ? and did you get a library with "zen" in the name (meaning autodetection of cpu type worked) - or if you did a DYNAMIC_ARCH build, what cpu type gets reported when you set OPENBLAS_VERBOSE=2 in the environment ?
it's not quite clear to me here how you got around the internal compiler error mentioned in your other ticket ? and did you get a library with "zen" in the name (meaning autodetection of cpu type worked) - or if you did a DYNAMIC_ARCH build, what cpu type gets reported when you set OPENBLAS_VERBOSE=2 in the environment ?
Thanks for the reply. With either autodetection or DYNAMIC_ARCH turned on on WSL I got "cooperlake" instead of "zen" which is weird. On WSL at least it built, while with Anaconda Prompt it failed like I wrote in the other ticket.
Looks like When I use the built OpenBLAS library to build my code with cblas_sbgemm
no AVX-512 related instructions are included in the object file.
Cooperlake would currently be correct for Zen4 (to make use of the AVX512BF16 instructions) but no AVX512 seen in the build is suspicious. I would think LLVM supports it, only a plain VS build would use slower C codes for everything
As far as I can tell, the AVX512 code paths should be taken (unless there is an error in your input data that gets caught in the interface/gemm.c code before calling the actual BLAS kernel for SBGEMM). Unfortunately I won't have access to Ryzen4 hardware until the weekend. Does the "test_sbgemm" executable in the test folder (comparing SGEMM and SBGEMM results) work for you without raising an error ?
BTW you could build for TARGET=ZEN (or set OPENBLAS_CORETYPE=ZEN for a DYNAMIC_ARCH build at runtime) to get non-AVX512 codes for comparison, but even if the AVX512_BF16 implementation on Zen4 was a lot less performant than on Intel Cooperlake I doubt that the penalty would amount to 30x. Hard to guess what else could be wrong though
So on Zen4 under plain Linux SBGEMM and SGEMM show basically equal performance according to my tests. When AVX512 is not available however, fallback to the generic C kernel for SBGEMM causes performance to suck a lot more than I remembered. This is probably what you are seeing in your WSL setup - either because AVX512 assembler kernels were not compiled in, or the hypervisor blocks/slows accesses to the AVX512 hardware
As far as I can tell, the AVX512 code paths should be taken (unless there is an error in your input data that gets caught in the interface/gemm.c code before calling the actual BLAS kernel for SBGEMM). Unfortunately I won't have access to Ryzen4 hardware until the weekend. Does the "test_sbgemm" executable in the test folder (comparing SGEMM and SBGEMM results) work for you without raising an error ?
Yes test_sbgemm works well functionally. I added a timing function there and the performance gap is still big. Here's the compilation command and the result in second:
cc -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DC_LAPACK -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=12 -DMAX_PARALLEL_NUMBER=1 -DBUILD_BFLOAT16 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.27\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -march=cooperlake -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I.. -o test_sbgemm compare_sgemm_sbgemm.c ../libopenblas_cooperlakep-r0.3.27.a -lm -lpthread -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. -lc
fp32: 0.055850, bf16: 0.557987
Seems like with -DBUILD_BFLOAT16
AVX512 is still not enabled as I didn't see any -m
flag that enables it. Any thoughts?
With LLVM 18 (as per your other issue) there should be a -march=cooperlake
that normally implies AVX512. (My test under Linux was done with GCC however, and the timings for both SGEMM and SBGEMM in the modified test_sbgemm were in the 0.0002s range)
So I tried adding a bunch of AVX512 related flags (-mavx512f -mavx512bf16 -mavx512pf -mavx512er -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg
) together with -march=cooperlake
to both the build commands of OpenBLAS and my own code. Running objdump -xD --demangle test_executable | grep zmm
prints instructions containing zmm
now, meaning building with bf16 seems to be working.
However the performance gap is still more or less the same... Really don't understand where the problem is.
With LLVM 18 (as per your other issue) there should be a
-march=cooperlake
that normally implies AVX512. (My test under Linux was done with GCC however, and the timings for both SGEMM and SBGEMM in the modified test_sbgemm were in the 0.0002s range)
May I ask what the AVX512 instructions corresponding to sbgemm
in your executable are? I saw there are three BF16-related instructions in AVX512. I only have vcvtneps2bf16
in my side, but I believe VDPBF16PS
is the one that does the compute. Does your executable have that instruction?
I think it simply boils down to whether your build uses the sbgemm_kernel_16x4_cooperlake.c from kernel/x86_64 (which uses intrinsics from immintrin.h) or not.
I think it simply boils down to whether your build uses the sbgemm_kernel_16x4_cooperlake.c from kernel/x86_64 (which uses intrinsics from immintrin.h) or not.
I see. How do I check that when I build OpenBLAS? Or, how do I enforce that in the following build command to quickly test it with test_sbgemm
?
cc -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DC_LAPACK -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=12 -DMAX_PARALLEL_NUMBER=1 -DBUILD_BFLOAT16 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.27\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -march=cooperlake -mavx512f -mavx512bf16 -mavx512pf -mavx512er -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I.. -o test_sbgemm compare_sgemm_sbgemm.c ../libopenblas_cooperlakep-r0.3.27.a -lm -lpthread -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. -lc
check if you see sbgemm_kernel_16x4_cooperlake.c in the build log for OpenBLAS (you may need to set CMAKE_VERBOSE_MAKEFILES to get complete build output). Or look in the kernel.vcxproj underneath your build folder, I think it should contain the full name of the source file to use for what becomes the sbgemm_kernel object
OK thanks. I rebuilt it and actually didn't see sbgemm_kernel_16x4_cooperlake.c coming up in the log. Is there any way to enforce it?
Can you upload the log please, or at least the config.h that was generated in your build folder ? Having COOPERLAKE as the build target should be enough to enforce it, unless your compiler does not support AVX512 or lacks the immintrin.h header file.
Here is the config.h and build log. Looks like sbgemm_kernel
is not built with AVX512 nor sbgemm_kernel_16x4_cooperlake.c
.
#define OS_LINUX 1
#define ARCH_X86_64 1
#define C_GCC 1
#define __64BIT__ 1
#define FUNDERSCORE
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define COOPERLAKE
#define L1_CODE_SIZE 32768
#define L1_CODE_ASSOCIATIVE 8
#define L1_CODE_LINESIZE 64
#define L1_DATA_SIZE 32768
#define L1_DATA_ASSOCIATIVE 8
#define L1_DATA_LINESIZE 64
#define L2_SIZE 1048576
#define L2_ASSOCIATIVE 8
#define L2_LINESIZE 64
#define ITB_SIZE 4096
#define ITB_ASSOCIATIVE 0
#define ITB_ENTRIES 64
#define DTB_SIZE 4096
#define DTB_ASSOCIATIVE 0
#define DTB_DEFAULT_ENTRIES 72
#define HAVE_CMOV
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define HAVE_SSSE3
#define HAVE_SSE4_1
#define HAVE_SSE4_2
#define HAVE_SSE4A
#define HAVE_AVX
#define HAVE_AVX2
#define HAVE_AVX512VL
#define HAVE_AVX512BF16
#define HAVE_FMA3
#define HAVE_CFLUSH
#define HAVE_MISALIGNSSE
#define HAVE_FASTMOVU
#define NUM_SHAREDCACHE 1
#define NUM_CORES 1
#define CORE_COOPERLAKE
#define CHAR_CORENAME "COOPERLAKE"
#define SLOCAL_BUFFER_SIZE 20480
#define DLOCAL_BUFFER_SIZE 12288
#define CLOCAL_BUFFER_SIZE 12288
#define ZLOCAL_BUFFER_SIZE 8192
#define GEMM_MULTITHREAD_THRESHOLD 4
And here's the truncated log for cmake configuration. The original was too big to be uploaded so I just truncate the sbgemm
related part. Please let me know if you need anything from the rest. Thanks!
At least the config has COOPERLAKE and HASAVX512BF16 as it should. Can you check/show what's in /mnt/c/Users/moderato/Documents/repos/OpenBLAS/build/linux/kernel/CMakeFiles/sbgemm_kernel.c please ? This should either have the optimized cooperlake kernel or the generic 2x2 one on its last line...
At least the config has COOPERLAKE and HASAVX512BF16 as it should. Can you check/show what's in /mnt/c/Users/moderato/Documents/repos/OpenBLAS/build/linux/kernel/CMakeFiles/sbgemm_kernel.c please ? This should either have the optimized cooperlake kernel or the generic 2x2 one on its last line...
Sorry for the late reply, was sick in the past few days. It's the generic 2x2 as it shows...
Hmm. Gmake build (build.txt) inexplicably terminated with an undefined macro although it is obviously present at the end of the config.h you posted - the latter file is probably from the cmake build attempt ? But I cannot tell much from truncated.txt - which version of gcc are you using in these builds ? No indication so far for why it went for the fallback kernel after apparently recognizing Cooperlake target and AVX512BF16 capability.
Hmm. Gmake build (build.txt) inexplicably terminated with an undefined macro although it is obviously present at the end of the config.h you posted - the latter file is probably from the cmake build attempt ? But I cannot tell much from truncated.txt - which version of gcc are you using in these builds ? No indication so far for why it went for the fallback kernel after apparently recognizing Cooperlake target and AVX512BF16 capability.
Ah sorry I just realized build.txt is a failed log. Here's the updated one. Also my gcc version is 11.4.0.
Unfortunately that does not tell me anything new, as all choices have already been made at this point. Can you redirect the output of the initial cmake
run to a file please ?
gcc 11.4 should be recent enough to support AVX512BF16 (and in particular the _mm512_dpbf16_ps
instruction that is used in a code snippet in c_check
to test if the compiler supports it). And I do not know of any limitations regarding AVX512BF16 in WSL - basically I think this should behave like a Linux build
Unfortunately that does not tell me anything new, as all choices have already been made at this point. Can you redirect the output of the initial
cmake
run to a file please ? gcc 11.4 should be recent enough to support AVX512BF16 (and in particular the_mm512_dpbf16_ps
instruction that is used in a code snippet inc_check
to test if the compiler supports it). And I do not know of any limitations regarding AVX512BF16 in WSL - basically I think this should behave like a Linux build
build.txt
is the output of the command cmake --build .
(I replace make
with it) and the previous trace_truncated.txt
is the output of the command cmake ..
. Please let me know if these are not what you need.
In the related issue I closed #4672 I mentioned I also observed the failure of building with BF16 on Windows, so to me this looks like a generic problem that is platform independent. The weird thing turns out to be that many cc
commands run with these flags -m64 -march=cooperlake -mavx2 -mavx -msse -msse2 -msse3 -mssse3 -msse4.1
. I would suppose a correct build should include those avx512 related flags here. How (and maybe where) are these flags specified? And is there a chance we can enforce something here?
The -march=cooperlake
covers all the AVX512 related flags already. This gets specified in cmake/cc.cmake when autodetection (or TARGET specification) produced COOPERLAKE.
Can you please provide a non-truncated output of cmake ..
without the "trace" setting ?
The
-march=cooperlake
covers all the AVX512 related flags already. This gets specified in cmake/cc.cmake when autodetection (or TARGET specification) produced COOPERLAKE. Can you please provide a non-truncated output ofcmake ..
without the "trace" setting ?
Sure: config.txt
Hmm, looks perfectly normal. There appears to be something going wrong with the handling of the ifneq...endif
conditional in kernel/x86_64/KERNEL.COOPERLAKE - I suspect it will build with the correct SBGEMM kernel if you remove the two lines.
Thank you Martin! This PR perfectly fixes the problem and now I can see a ~2x speedup of BF16 GEMM against FP32.
Really appreciate it.
Sorry it took me a while to understand the source of the problem.
Definitely no need to be sorry Martin. This seems to be a deep issue and you've been so helpful and providing so much useful guidance all the way. Thank you!
Hi, I encounter the same problem. I compile the OpenBLAS 0.3.27 using gcc13 and run the cblas_sbgemm on the same CPU AMD 7640hs, and the system is Ubuntu22.04. Compared with fp32, the bfloat16 matrix multiplication is about 4x slower.
I also try to remove the relevant two lines in "KERNEL.COOPERLAKE", but it does not work...
Please do not remove the lines in KERNEL.COOPERLAKE, you need to fix the cmake script that reads them. (That is, apply PR #4695 or simply copy the contents of the file cmake/utils.cmake from the github view of the develop
branch)
Thanks Martin. Actually, I have already tried the newest develop branch to compile OpenBLAS without any modification. However, I got the same result: cblas_sgemm is still 4x faster than cblas_sbgemm.
And I just compile it with the same parameters as above.
cmake -DBUILD_BFLOAT16=ON -DBUILD_WITHOUT_LAPACK=yes -DNOFORTRAN=1 ..
make -j
make install
I don not think anything changed since then that could have broke this again. Can you please check that your build actually uses sbgemm_kernel_16x4_cooperlake.c as the sbgemm_kernel source ?
Yes, It does. In the file /kernel/CMakeFiles/sbgemm_kernel.c
, it shows the sbgemm_kernel_16x4_cooperlake.c
in the last line.
Maybe something wrong with my system? I ever installed version 0.3.20 using apt, but have already uninstalled it. When compiling, I only include the new version. I do not know if this caused the poor performance. It is weird...
Thanks Martin.
Hello, I'm trying to run
cblas_sbgemm
on WSL & AMD CPU but find it extremely slow, e.g. 30x slower thancblas_sgemm
. Anyone knows how to debug and solve this issue?Related issue (run on Anaconda Prompt): https://github.com/OpenMathLib/OpenBLAS/issues/4672
System info:
Build command: