ROCm / rocWMMA

rocWMMA
https://rocm.docs.amd.com/projects/rocWMMA/
MIT License
91 stars 26 forks source link

[Discussion] Do we have/Where can we find swizzling rules in rocwmma to avoid bank conflict? #457

Closed LeiWang1999 closed 1 week ago

LeiWang1999 commented 3 weeks ago

Problem Description

avoiding bank conflicts is critical for optimizing performance, do we currently have any specific swizzling rules in CK to avoid bank conflict?

Operating System

Ubuntu

CPU

AMD

GPU

MI 250

ROCm Version

ROCm 5.7.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

ppanchad-amd commented 2 weeks ago

Hi @LeiWang1999. Internal ticket has been created to assist with your question. Thanks!

taylding-amd commented 1 week ago

Hi @LeiWang1999, this is a very long and detailed discussion. In general, here is a good example on swizzling. https://leimao.github.io/blog/CUDA-Shared-Memory-Swizzling/

However, for rocWMMA the thread coordinates are not visible to the user. If we consider LDS banks are contiguous memory addresses, then we generally recommend the following:

If your LDS layout is row_major: Storing matrix_a frags have higher incidence of bank conflict, especially for BlockM > 16 Recommendation: pad the LDS K - dimension by +1 or more (e.g., ldlds = BlockK + 1 or more) if bank conflicts are detected. Storing matrix_b frags have a low incidence of bank conflict.

If your LDS layout is col_major, essentially the reverse: Storing matrix_b frags have higher incidence of bank conflict, especially for BlockN > 16. Recommendation: pad the LDS K - dimension by +1 or more (e.g., ldlds = BlockK + 1 or more) if bank conflicts are detected. Storing matrix_a frags have a low incidence of bank conflict.

LeiWang1999 commented 1 week ago

Thanks @taylding-amd for your response, I'm curious to know where the relevant source code is implemented. Is it part of the rocWMMA project, or is it handled within Tensile?

cgmillette commented 1 week ago

@LeiWang1999 rocWMMA is a standalone project and is not related to Tensile or CK at the moment. Thread mapping is handled in rocWMMA layout source code. This mapping applies opaquely to either global or lds memory spaces

LeiWang1999 commented 1 week ago

Thanks @cgmillette , helps me a lot, i mixed up rocwmma with rocblas :)

cgmillette commented 1 week ago

Welcome! Anything else I can help with before resolving the issue?

LeiWang1999 commented 1 week ago

One last question: does Tensile handle bank conflicts or data layout in the same way as rocWMMA?

cgmillette commented 1 week ago

Quick answer is not exactly. Tensile has many different layouts that they use, but padding is definitely one of the techniques they use to minimize LDS bank conflicts when they arise. However, you might be able to get more specific information from that team.

LeiWang1999 commented 1 week ago

Thanks, helps me a lot already!

Closed :)