Open ltjsjyyy opened 1 year ago
Hi. I've conducted some experiments using scripts from https://github.com/flame/blis/blob/master/docs/Performance.md and AMD's fork of BLIS. I tested only GEMM and only in multithread mode, as https://github.com/amd/blis/tree/master/test/3 output is not compatible with https://github.com/flame/blis/tree/master/test/3 , but this test was enough for initial needs.
My setup:
CC="clang" CXX="clang++" AR="llvm-ar" RANLIB="llvm-ranlib" ./configure -t openmp zen4
)zen3
kernels. All libraries in general were compiled with native to zen4 flags.Commands executed:
BLIS_NUM_THREADS=32 ./test_sgemm_5120_asm_blis_st.x # amd-blis
BLIS_NUM_THREADS=32 ./test_gemm_blis_mt.x -d s -c nn -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32 ./test_gemm_vendor_mt.x -d s -c nn -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d s -c nn -i native -p "256 5120 128" -r 3 -v
BLIS_NUM_THREADS=32 ./test_dgemm_5120_asm_blis_st.x # amd-blis
BLIS_NUM_THREADS=32 ./test_gemm_blis_mt.x -d d -c nn -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32 ./test_gemm_vendor_mt.x -d d -c nn -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d d -c nn -i native -p "256 5120 128" -r 3 -v
Results:
Comments: AMD fork of BLIS significantly outperforms all other libraries on AMD Ryzen 9 7950X3D with Zen4 kernels (up to 2x). Vanilla BLIS is on par with OpenBLAS, but slower than MKL. There is a performance drop in MKL library for some sizes, but it looks like a fluke (it disappears for larger sizes). When checking gemm for larger matrices (like 6000*6000) performance was the same for all 4 libraries (supposedly due to memory bottleneck on my system).
@AngryLoki Thank you for taking the time to gather, visualize, and share these performance results! Don't worry; a proper zen4
subconfiguration will be added to vanilla BLIS in the future. We are just overwhelmed with work these days relative to our resources. Thanks for your patience in the meantime. ❤️
PS: Please feel free to keep up with us in our Discord server, if you haven't already joined! 😄
@AngryLoki thank you for this information.
I am curious, did you also test AMD/blis compiled with AOCC? I've been experimenting with it on my system (Gentoo AMD 7840U) and it's performing well on certain tasks.
@HaukurPall , checked sgemm (M=N=K) with gcc 13.2.1 (+full lto), clang 17.0.6, AOCC and rocm-llvm-alt. Results are the same, almost the same.
I checked the code of AOCC and unfortunately I don't see any specific optimizations... AMD just shipped vanilla precompiled Clang and included some ROCm-related fixed (to make it work, not for optimization). Also they added https://github.com/ROCm/llvm-project/commit/0272becdab2be383036a3d9409041996c5fa5fa6 - if you attempt to use -famd-opt
, it tries to use for proprietary version of Clang - rocm-llvm-alt - which actually has some interesting optimizations. However even after installing rocm-llvm-alt I was not able to increase performance for AOCL-BLAS. Anyways, ICX, AOCC and rocm-llvm-alt are basically Clang. With -flto
they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.
Regarding my previous tests, I checked my approach more carefully and found few misses from my side:
if cpu = zen: use slow code, we shipped extra megabytes specifically to degrade AMD performance
. Followed https://documentation.sigma2.no/jobs/mkl.html#forcing-mkl-to-use-best-performing-routines and it made MKL 2 times faster. OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY=0-15
BLIS is usually pretty insensitive to compiler since most of the work happens in the inline assembly kernels.
With -flto they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.
I consider this a good thing since LLVM (and to fair other compilers too) really make a hash of C or intrinsics kernels due to a combination of poor register allocation and instruction ordering.
Glad to see that AOCL-BLIS is performing well for you though. As we work with AMD to backport their changes BLIS will catch up.
@AngryLoki thank you so much for this, this answers a lot of questions!
Zen4 is already support in AMD's fork of BLIS. We're in contact with AMD on coordinating how best to back-port these changes to BLIS master.