NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.66k stars 967 forks source link

[QST] Split-k in hopper gather scatter gemm #1798

Open susavlsh10 opened 2 months ago

susavlsh10 commented 2 months ago

What is your question? How to use split-k in hopper gather scatter gemm?

I am trying to optimize a skinny gemm on hopper with the gather scatter kernel (adapting from example 52_hopper_gather_scatter_fusion). The default kernel doesn't perform the best for different index_sizes. I have the following two use cases.

use case 1: M x N x K = 128 x 32768 x 8192; gather from N, index_size = 8192-16384

use case 2: M x N x K = 128 x 8192 x 32768; gather from K, index_size = 8192-16384

The profiler suggests that using split-k of 1-4 is often sufficient to maximize the performance of skinny gemms (different k depending on the problem size). This worked well with Ampere when adapting example 36_gather_scatter_fusion.

I read https://github.com/NVIDIA/cutlass/issues/1079 to use split-k for hopper by using cutlass::gemm::StreamKScheduler as the tile scheduler and splits argument to be larger than 1. This did not work with the gather scatter hopper kernel in example 52. I got the following assertion in the file gather_gemm.hpp.

  static_assert(cute::is_void_v<TileScheduler_> or cute::is_same_v<TileScheduler_, PersistentScheduler>,
    "Non-persistent warp-specialized kernel does not support specializing the tile scheduler.");

I removed the assertion, and the code did compile; but the performance did not improve as expected

Can we use split-k in hopper with the gather scatter gemm kernel? If so, how do I do this and what needs to be changed? Any help on this matter is greatly appreciated.

github-actions[bot] commented 1 month ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.