ROCm / triton

Development repository for the Triton language and compiler
MIT License
96 stars 29 forks source link

Move utility tools from triton-mlir to main_perf branch #635

Closed zhanglx13 closed 2 months ago

zhanglx13 commented 2 months ago
zhanglx13 commented 2 months ago

@ravil-mobile Some formatter made some changes to the amdgcn-cfg tool. Can you double check everything is still working well?

zhanglx13 commented 2 months ago

I think we're good to go with respect to plot-layout and occ.sh. I'm not familiar with amdgcn-cfg, a thumbs up from someone that's used to it is required IMHO.

I also think we can move python/perf-kernels/tune_gemm to python/perf-kernels/tools/tune_gemm in a future PR so all tools would be in tools directory.

This is a good idea. Let's do it here.

LeiWang1999 commented 1 week ago

Hi all, thanks for the scripts to visualize the layout transformations. But after I applied padding and swizzling rules, bank conflicts still exist, do these rules can help avoid bank conflict?

zhanglx13 commented 1 week ago

@LeiWang1999 It depends on your LDS access pattern for your application. We are focusing on matmul and flash-attention kernels, in which padding can always resolve LDS bank conflicts. So does swizzling.

LeiWang1999 commented 1 week ago

Thanks @zhanglx13 for your resposne! I'm also working on matmul and flash-atten, and thanks. After debugging this project, I found that the conflict is resolved only when the swizzle is applied with the LDS vector size set to 4 (4 x fp16, 64bits).

However, since the memory access instruction allows for a maximum of 128 bits, could this result in memory waste?

.

zhanglx13 commented 1 week ago

Re bank conflicts. I think we should not have bank conflicts even with ds_read/write_b128 enabled. You can double check your kernel's access pattern. And narrow down the bank conflicts to see if they are from ds_read, write or both.

Re memory waste. If you only look at a single ds_read_b128 vs ds_read_b64, then yes, ds_read_b128 is more efficient in the way that one address can serve 128-bit data. However, whether it is a waste depends on whether the LDS access is the bottleneck. For compute bound kernels, LDS accesses are not the bottleneck usually. So using ds_read_b64 should not be a problem.

LeiWang1999 commented 6 days ago

@zhanglx13 , thanks for the explanation—it makes sense to me, and it’s fascinating! I just have one final question: how can we determine whether our Triton kernel is utilizing ds_read_b128 instead of ds_read_b64?

When I printed the SwizzleParams during compilation, it consistently shows vec=4 in my FP16 matmul case. Does this imply that the kernel is using ds_read_b64?

Here’s the relevant structure for reference:

struct SwizzleParams {
  int vec;
  int perPhase;
  int maxPhase;
};

Thank you for your help!

zhanglx13 commented 6 days ago

When I printed the SwizzleParams during compilation, it consistently shows vec=4 in my FP16 matmul case. Does this imply that the kernel is using ds_read_b64?

Usually yes. If you want to double check and confirm, you can dump the assembly code with AMDGCN_ENABLE_DUMP=1 and search for ds_read_.

You can also set kpack as kernel argument. =1 means b64 and =2 means b128. Here is some comment explaining what kpack means: https://github.com/triton-lang/triton/blob/433037206d8870f0b82a3cd669097001084a29ed/third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp#L419

LeiWang1999 commented 6 days ago

Hi, still one thing confuses me. Given that the MFMA operation for FP16 uses the shape m16n16k16, each MFMA consumes a 16x16 block of input matrix A. For this, each thread holds 4 elements in the order shown below:

image

If we want to utilize ds_read_128, we would need to perform an additional shuffling step to group two 64-bit chunks together (I’ve marked the storage layout in the diagram below):

image

Did we apply another layout transformation to fully leverage ds_read_128?

LeiWang1999 commented 5 days ago

Hi, still one thing confuses me. Given that the MFMA operation for FP16 uses the shape m16n16k16, each MFMA consumes a 16x16 block of input matrix A. For this, each thread holds 4 elements in the order shown below:

image

If we want to utilize ds_read_128, we would need to perform an additional shuffling step to group two 64-bit chunks together (I’ve marked the storage layout in the diagram below):

image

Did we apply another layout transformation to fully leverage ds_read_128?

I think I've got the point, thanks :)