Closed LeiWang1999 closed 1 week ago
Hi @LeiWang1999. Internal ticket has been created to assist with your question. Thanks!
Hi @LeiWang1999, this is a very long and detailed discussion. In general, here is a good example on swizzling. https://leimao.github.io/blog/CUDA-Shared-Memory-Swizzling/
However, for rocWMMA the thread coordinates are not visible to the user. If we consider LDS banks are contiguous memory addresses, then we generally recommend the following:
If your LDS layout is row_major: Storing matrix_a frags have higher incidence of bank conflict, especially for BlockM > 16 Recommendation: pad the LDS K - dimension by +1 or more (e.g., ldlds = BlockK + 1 or more) if bank conflicts are detected. Storing matrix_b frags have a low incidence of bank conflict.
If your LDS layout is col_major, essentially the reverse: Storing matrix_b frags have higher incidence of bank conflict, especially for BlockN > 16. Recommendation: pad the LDS K - dimension by +1 or more (e.g., ldlds = BlockK + 1 or more) if bank conflicts are detected. Storing matrix_a frags have a low incidence of bank conflict.
Thanks @taylding-amd for your response, I'm curious to know where the relevant source code is implemented. Is it part of the rocWMMA project, or is it handled within Tensile?
@LeiWang1999 rocWMMA is a standalone project and is not related to Tensile or CK at the moment. Thread mapping is handled in rocWMMA layout source code. This mapping applies opaquely to either global or lds memory spaces
Thanks @cgmillette , helps me a lot, i mixed up rocwmma with rocblas :)
Welcome! Anything else I can help with before resolving the issue?
One last question: does Tensile handle bank conflicts or data layout in the same way as rocWMMA?
Quick answer is not exactly. Tensile has many different layouts that they use, but padding is definitely one of the techniques they use to minimize LDS bank conflicts when they arise. However, you might be able to get more specific information from that team.
Thanks, helps me a lot already!
Closed :)
Problem Description
avoiding bank conflicts is critical for optimizing performance, do we currently have any specific swizzling rules in CK to avoid bank conflict?
Operating System
Ubuntu
CPU
AMD
GPU
MI 250
ROCm Version
ROCm 5.7.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response