NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.65k stars 966 forks source link

[QST] Question about SMs allocation and Persistent Cooperative kernel design. #1938

Closed Jacfger closed 17 hours ago

Jacfger commented 1 day ago

What is your question?

From my understanding (after searching issues around this), persistent cooperative kernel design is to have the kernel to occupy as much as SM as possible, which is deduce during runtime via https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/kernel_hardware_info.h , which I'm currently trying work on top of this idea. However, when I'm considering synchronization between blocks, I came across this post https://forums.developer.nvidia.com/t/fixing-sms-for-a-kernel/44619/6, which should mean even though we know how many available SMs, it is not necessarily guaranteed the kernel is launched with the desired amount of SMs. So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen? As I also don't think the response of the post in the forum is wrong.

thakkarV commented 1 day ago

So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?

The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.

Jacfger commented 17 hours ago

So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?

The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.

Ah I see, so it does matter and depends on the algorithm designs. Thanks.