NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.7k stars 978 forks source link

[QST]how to use one threadblock process one matrix multiplication? #1617

Open alephchang opened 4 months ago

alephchang commented 4 months ago

I have a thousand of tasks in parallel, each task has two steps:

  1. matrix multiplication, C[i] = A[i]*B[i], the matrix sizes are non-uniform, and (m, n, k) is in range 10 ~1024.
  2. some operation on C[i], like scatter matrix C[i] to another matrix D[i]

I can use grouped GEMM in cutlass to do step 1 and then use a kernel to complete step 2 on all of the tasks. but it looks not efficient enough. I think it will be better if I use one thread block to do the step 1 and 2 for each task in parallel.

Here is my question: is there any similar example in cutlass? or any suggestion on this problem?

Thanks

thakkarV commented 3 months ago

are you asking for a grouped gather/scatter GEMM? if so, I do not think an example exists today, but it should be easy to build one on top of the existing Ampere grouped GEMM kernel we have

github-actions[bot] commented 2 months ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

alephchang commented 2 months ago

are you asking for a grouped gather/scatter GEMM? if so, I do not think an example exists today, but it should be easy to build one on top of the existing Ampere grouped GEMM kernel we have

Thank you for your reply. The grouped scatter GEMM is exactly what I need. I have another question: In my scenario, different grouped GEMM operations may scatter to the same matrix D. For instance, D += scatter(A_0 B_0) + scatter(A_1 B_1), where both GEMM operations are part of a single grouped GEMM. A straightforward approach is to store the results of scatter(A_0 B_0) and scatter(A_1 B_1) in global memory, followed by invoking another kernel to accumulate these results into matrix D. My question is: Is it possible to perform the atomic accumulation on matrix D within an epilogue function? Are there any related examples?