NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.44k stars 922 forks source link

[QST/BUG] Should shared memory usage be checked for multistage pipeline? #1525

Closed wzhcz8902 closed 1 month ago

wzhcz8902 commented 4 months ago

https://github.com/NVIDIA/cutlass/blob/033d9efd2db0bbbcf3b3b0650acde6c472f3948e/include/cutlass/gemm/kernel/gemm.h#L153-L199

For multistage pipeline, the usage of shared memory is proportional with the number of stages applied, so there exists a maximum value of the stages beyond which there will be errors running the kernel. I checked the can_implement function, which seems only care about the alignment of tensor addresses in global memory. Should shared memory usage be checked? Why is it important to make sure the global address is aligned?

github-actions[bot] commented 3 months ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

hwu36 commented 2 months ago

Should shared memory usage be checked?

It is nice to have, but not critical. shared memory usage is decided statically. wrong tile size and stage combination should not be instantiated. this can_implement function mostly checks runtime values.

Why is it important to make sure the global address is aligned?

unaligned address will cause illegal memory access failure.