[Question] The source code of highly optimized gemm kernel

ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform

https://rocm.docs.amd.com/projects/rocBLAS/en/latest/

Other

346 stars 166 forks source link

[Question] The source code of highly optimized gemm kernel #1512

Open mmyxym opened 2 weeks ago

mmyxym commented 2 weeks ago

I'm looking for highly optimized gemm kernel source code, such as a typical f16 gemm implmentation with M,N,K=1024, but didn't find it through the repo of rocBLAS/hipBlasLt/rocWMMA/Tensile. Any information is appreciated, thanks!

babakpst commented 2 weeks ago

Thanks for your question. Almost all GEMM kernels, including F16, use assembly kernels stored in the repo for each architecture here: library/src/blas3/Tensile/Logic/asm_full. If you are interested to see the solution/kernel in the library logics for a particular GEMM or size, you need to find the size in the library and find the corresponding solutionIndex, and find it in the library. [m,n,batch,k] [solutionIndex, efficiency] Alternatively, (and f the size does not exit in the library), you can run your GEMM with this flag to print down the solutionIndex: TENSILE_DB=0X20000. The actual assembly kernels are in build_tmp in build.

mmyxym commented 1 week ago

@babakpst, thanks very much for your reply! I tried to use "TENSILE_DB=0X20000" to run torch.matmul and it gave Library logic solution index of winning solution: 39. I guess we have no assembly kernels but only binary in docker image. And I build rocBlas from source code and see those assembly kernels in build/release/library/src/build_tmp/TENSILE/assembly/. Could you please help me about how to associate the solution index: 39 with the exact assembly kernel? Thanks!

LeiWang1999 commented 1 week ago

I'm looking for the permute policy that rocblas applied to avoid bank conflict btw, anybody have experience about that? :)

With FP16 MFMA on MI250