Open alephchang opened 4 months ago
are you asking for a grouped gather/scatter GEMM? if so, I do not think an example exists today, but it should be easy to build one on top of the existing Ampere grouped GEMM kernel we have
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
are you asking for a grouped gather/scatter GEMM? if so, I do not think an example exists today, but it should be easy to build one on top of the existing Ampere grouped GEMM kernel we have
Thank you for your reply. The grouped scatter GEMM is exactly what I need. I have another question: In my scenario, different grouped GEMM operations may scatter to the same matrix D. For instance, D += scatter(A_0 B_0) + scatter(A_1 B_1), where both GEMM operations are part of a single grouped GEMM. A straightforward approach is to store the results of scatter(A_0 B_0) and scatter(A_1 B_1) in global memory, followed by invoking another kernel to accumulate these results into matrix D. My question is: Is it possible to perform the atomic accumulation on matrix D within an epilogue function? Are there any related examples?
I have a thousand of tasks in parallel, each task has two steps:
I can use grouped GEMM in cutlass to do step 1 and then use a kernel to complete step 2 on all of the tasks. but it looks not efficient enough. I think it will be better if I use one thread block to do the step 1 and 2 for each task in parallel.
Here is my question: is there any similar example in cutlass? or any suggestion on this problem?
Thanks