flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.27k stars 365 forks source link

When could you support AMD Zen4 arch? #770

Open ltjsjyyy opened 1 year ago

devinamatthews commented 1 year ago

Zen4 is already support in AMD's fork of BLIS. We're in contact with AMD on coordinating how best to back-port these changes to BLIS master.

AngryLoki commented 10 months ago

Hi. I've conducted some experiments using scripts from https://github.com/flame/blis/blob/master/docs/Performance.md and AMD's fork of BLIS. I tested only GEMM and only in multithread mode, as https://github.com/amd/blis/tree/master/test/3 output is not compatible with https://github.com/flame/blis/tree/master/test/3 , but this test was enough for initial needs.

My setup:

Commands executed:

BLIS_NUM_THREADS=32     ./test_sgemm_5120_asm_blis_st.x  # amd-blis
BLIS_NUM_THREADS=32     ./test_gemm_blis_mt.x     -d s -c nn   -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32      ./test_gemm_vendor_mt.x   -d s -c nn   -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d s -c nn   -i native -p "256 5120 128" -r 3 -v

BLIS_NUM_THREADS=32     ./test_dgemm_5120_asm_blis_st.x  # amd-blis
BLIS_NUM_THREADS=32     ./test_gemm_blis_mt.x     -d d -c nn   -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32      ./test_gemm_vendor_mt.x   -d d -c nn   -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d d -c nn   -i native -p "256 5120 128" -r 3 -v

Results:

image

Comments: AMD fork of BLIS significantly outperforms all other libraries on AMD Ryzen 9 7950X3D with Zen4 kernels (up to 2x). Vanilla BLIS is on par with OpenBLAS, but slower than MKL. There is a performance drop in MKL library for some sizes, but it looks like a fluke (it disappears for larger sizes). When checking gemm for larger matrices (like 6000*6000) performance was the same for all 4 libraries (supposedly due to memory bottleneck on my system).

fgvanzee commented 10 months ago

@AngryLoki Thank you for taking the time to gather, visualize, and share these performance results! Don't worry; a proper zen4 subconfiguration will be added to vanilla BLIS in the future. We are just overwhelmed with work these days relative to our resources. Thanks for your patience in the meantime. ❤️

PS: Please feel free to keep up with us in our Discord server, if you haven't already joined! 😄

HaukurPall commented 8 months ago

@AngryLoki thank you for this information.

I am curious, did you also test AMD/blis compiled with AOCC? I've been experimenting with it on my system (Gentoo AMD 7840U) and it's performing well on certain tasks.

AngryLoki commented 8 months ago

@HaukurPall , checked sgemm (M=N=K) with gcc 13.2.1 (+full lto), clang 17.0.6, AOCC and rocm-llvm-alt. Results are the same, almost the same. compilers

I checked the code of AOCC and unfortunately I don't see any specific optimizations... AMD just shipped vanilla precompiled Clang and included some ROCm-related fixed (to make it work, not for optimization). Also they added https://github.com/ROCm/llvm-project/commit/0272becdab2be383036a3d9409041996c5fa5fa6 - if you attempt to use -famd-opt, it tries to use for proprietary version of Clang - rocm-llvm-alt - which actually has some interesting optimizations. However even after installing rocm-llvm-alt I was not able to increase performance for AOCL-BLAS. Anyways, ICX, AOCC and rocm-llvm-alt are basically Clang. With -flto they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.

Regarding my previous tests, I checked my approach more carefully and found few misses from my side:

devinamatthews commented 8 months ago

BLIS is usually pretty insensitive to compiler since most of the work happens in the inline assembly kernels.

With -flto they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.

I consider this a good thing since LLVM (and to fair other compilers too) really make a hash of C or intrinsics kernels due to a combination of poor register allocation and instruction ordering.

Glad to see that AOCL-BLIS is performing well for you though. As we work with AMD to backport their changes BLIS will catch up.

HaukurPall commented 7 months ago

@AngryLoki thank you so much for this, this answers a lot of questions!