ROCm / composable_kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
https://rocm.docs.amd.com/projects/composable_kernel/en/latest/
Other
309 stars 126 forks source link

[Discussion] Do we have/Where can we find swizzling rules in ck to avoid bank conflict? #1603

Open LeiWang1999 opened 2 weeks ago

LeiWang1999 commented 2 weeks ago

Problem Description

avoiding bank conflicts is critical for optimizing performance, do we currently have any specific swizzling rules in CK to avoid bank conflict?

Operating System

Ubuntu 20.04

CPU

AMD

GPU

AMD Instinct MI250

Other

No response

ROCm Version

ROCm 5.7.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

LeiWang1999 commented 2 weeks ago

or do we have documents for this transformation that we can reproduce without template? :)

chengyupku commented 2 weeks ago

Encountered the same problem... Does anyone have a solution?

ppanchad-amd commented 2 weeks ago

Hi @LeiWang1999. Internal ticket has been created to assist with your question. Thanks!

schung-amd commented 2 weeks ago

Hi, not sure what you mean by "rules"; are you looking for general guidance on swizzling to avoid bank conflicts, or are you looking for tools to help you here?

LeiWang1999 commented 2 weeks ago

@ppanchad-amd @schung-amd , thanks for your response!

I’m looking for an affine transform expression to eliminate bank conflicts when using MFMA (16x16x16 FP16 input, FP32 accumulation in my case).

For example, in cutlass, they utilize a xor based permutation:

image

I wanna know that where does the composable kernel handle this problem (give a lambda i, j : (f(i, j))) will be the best.

When I profiled the CK GEMM example with Omniperf, it appeared conflict-free.

LeiWang1999 commented 2 weeks ago

I also noticed that Composable Kernel might use coordinate transforms to handle this, but I’m still unclear on the exact approach.

schung-amd commented 4 days ago

Hi @LeiWang1999, sorry for the delay. I'm not aware of any public-facing documentation we have for this, but I'm reaching out to the internal teams to see if we have anything (docs or guidance) at the moment and if we should produce some documentation for it.

Also noticed you opened https://github.com/ROCm/Tensile/issues/2043; I'll leave both open for now, but the answers here should apply to both and I'll update/close both once we have a satisfactory answer.

Thanks for your interest! Hopefully we'll be able to provide some guidance beyond linking source code.