NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.62k stars 959 forks source link

[FEA] Better grid size for H100 GPU with SXM5 #1921

Open zhipeng93 opened 2 hours ago

zhipeng93 commented 2 hours ago

Is your feature request related to a problem? Please describe. When doing gemm in Hopper, we need to decide the grid size based on problem size, cluster shape and the hopper architectures.

Currently, cutlass only considered GH100 GPU but not H100 GPU with SXM5 [1]. They have different architectures [2]:

In H100 GPU with SXM5, cutlass may fail to find the optimal grid size, leading to sub-optimal performance. For example, when doing gemm mnk=(4096, 4096, 4096) on H100 GPU with SXM5 and using cluster shape (4, 2, 1), the result grid size is (4, 28, 1). However, cublas used grid size (60, 2, 1), and cublas has a better performance by 17%.

cutlass settings:

using TileShape           = Shape<_128,_128,_64>; 
using ClusterShape        = Shape<_4,_2,_1>;
cutlass::gemm::KernelTmaWarpSpecializedCooperative

[1] https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/tile_scheduler_params.h#L249-L259 [2] https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

Describe the solution you'd like Compute grid size according to the GPU archi.

Describe alternatives you've considered

Additional context

zhipeng93 commented 2 hours ago

When I try to hack the cutlass code for better performance on H100 GPU with SXM5, I find that when using cluster shape (4, 2, 1) or (2, 2, 1), only 120 SMs can be used. That is, when I set grid size greater than 120, the waves per SM would be greater than 1.

For example,

I also check this setting: cluster shape (1, 1, 1) and grid size (132, 1, 1), waves per SM is 1.

Any ideas on the number of SMs that can be used when using thread block cluster?

zhipeng93 commented 2 hours ago

cc @hwu36 @Junkai-Wu

hwu36 commented 1 hour ago

@ANIKET-SHIVAM