Open LeiWang1999 opened 2 weeks ago
or do we have documents for this transformation that we can reproduce without template? :)
Encountered the same problem... Does anyone have a solution?
Hi @LeiWang1999. Internal ticket has been created to assist with your question. Thanks!
Hi, not sure what you mean by "rules"; are you looking for general guidance on swizzling to avoid bank conflicts, or are you looking for tools to help you here?
@ppanchad-amd @schung-amd , thanks for your response!
I’m looking for an affine transform expression to eliminate bank conflicts when using MFMA (16x16x16 FP16 input, FP32 accumulation in my case).
For example, in cutlass, they utilize a xor based permutation:
I wanna know that where does the composable kernel handle this problem (give a lambda i, j : (f(i, j))) will be the best.
When I profiled the CK GEMM example with Omniperf, it appeared conflict-free.
I also noticed that Composable Kernel might use coordinate transforms to handle this, but I’m still unclear on the exact approach.
Hi @LeiWang1999, sorry for the delay. I'm not aware of any public-facing documentation we have for this, but I'm reaching out to the internal teams to see if we have anything (docs or guidance) at the moment and if we should produce some documentation for it.
Also noticed you opened https://github.com/ROCm/Tensile/issues/2043; I'll leave both open for now, but the answers here should apply to both and I'll update/close both once we have a satisfactory answer.
Thanks for your interest! Hopefully we'll be able to provide some guidance beyond linking source code.
Problem Description
avoiding bank conflicts is critical for optimizing performance, do we currently have any specific swizzling rules in CK to avoid bank conflict?
Operating System
Ubuntu 20.04
CPU
AMD
GPU
AMD Instinct MI250
Other
No response
ROCm Version
ROCm 5.7.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response