Closed zhanglx13 closed 2 months ago
@ravil-mobile Some formatter made some changes to the amdgcn-cfg tool. Can you double check everything is still working well?
I think we're good to go with respect to
plot-layout
andocc.sh
. I'm not familiar withamdgcn-cfg
, a thumbs up from someone that's used to it is required IMHO.I also think we can move
python/perf-kernels/tune_gemm
topython/perf-kernels/tools/tune_gemm
in a future PR so all tools would be intools
directory.
This is a good idea. Let's do it here.
Hi all, thanks for the scripts to visualize the layout transformations. But after I applied padding and swizzling rules, bank conflicts still exist, do these rules can help avoid bank conflict?
@LeiWang1999 It depends on your LDS access pattern for your application. We are focusing on matmul and flash-attention kernels, in which padding can always resolve LDS bank conflicts. So does swizzling.
Thanks @zhanglx13 for your resposne! I'm also working on matmul and flash-atten, and thanks. After debugging this project, I found that the conflict is resolved only when the swizzle is applied with the LDS vector size set to 4 (4 x fp16, 64bits).
However, since the memory access instruction allows for a maximum of 128 bits, could this result in memory waste?
.
Re bank conflicts. I think we should not have bank conflicts even with ds_read/write_b128 enabled. You can double check your kernel's access pattern. And narrow down the bank conflicts to see if they are from ds_read, write or both.
Re memory waste. If you only look at a single ds_read_b128 vs ds_read_b64, then yes, ds_read_b128 is more efficient in the way that one address can serve 128-bit data. However, whether it is a waste depends on whether the LDS access is the bottleneck. For compute bound kernels, LDS accesses are not the bottleneck usually. So using ds_read_b64 should not be a problem.
@zhanglx13 , thanks for the explanation—it makes sense to me, and it’s fascinating! I just have one final question: how can we determine whether our Triton kernel is utilizing ds_read_b128 instead of ds_read_b64?
When I printed the SwizzleParams during compilation, it consistently shows vec=4 in my FP16 matmul case. Does this imply that the kernel is using ds_read_b64?
Here’s the relevant structure for reference:
struct SwizzleParams {
int vec;
int perPhase;
int maxPhase;
};
Thank you for your help!
When I printed the SwizzleParams during compilation, it consistently shows vec=4 in my FP16 matmul case. Does this imply that the kernel is using ds_read_b64?
Usually yes.
If you want to double check and confirm, you can dump the assembly code with AMDGCN_ENABLE_DUMP=1
and search for ds_read_
.
You can also set kpack
as kernel argument. =1 means b64 and =2 means b128. Here is some comment explaining what kpack means: https://github.com/triton-lang/triton/blob/433037206d8870f0b82a3cd669097001084a29ed/third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp#L419
Hi, still one thing confuses me. Given that the MFMA operation for FP16 uses the shape m16n16k16, each MFMA consumes a 16x16 block of input matrix A. For this, each thread holds 4 elements in the order shown below:
If we want to utilize ds_read_128, we would need to perform an additional shuffling step to group two 64-bit chunks together (I’ve marked the storage layout in the diagram below):
Did we apply another layout transformation to fully leverage ds_read_128?
Hi, still one thing confuses me. Given that the MFMA operation for FP16 uses the shape m16n16k16, each MFMA consumes a 16x16 block of input matrix A. For this, each thread holds 4 elements in the order shown below:
If we want to utilize ds_read_128, we would need to perform an additional shuffling step to group two 64-bit chunks together (I’ve marked the storage layout in the diagram below):
Did we apply another layout transformation to fully leverage ds_read_128?
I think I've got the point, thanks :)