NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.4k stars 909 forks source link

[QST] Difference between rs and ss #1825

Open haeunlee99 opened 5 days ago

haeunlee99 commented 5 days ago

What is your question?

What is difference between rs and ss? For example how are those two files different?

sm90_mma_array_tma_gmma_ss_warpspecialized.hpp sm90_mma_multistage_gmma_rs_warpspecialized.hpp

Thanks!

thakkarV commented 5 days ago

RS uses Hopper MMAs that source A from registers. SS uses MMAs that source A from smem

haeunlee99 commented 4 days ago

Thank you for your response. I thought in hopper SS is used by default. I have several follow up questions.

  1. Which of RS and SS is considered to be optimal? Is there a guide to reproduce cuBLAS like near optimal performance? (e.g. shared memory swizzle, thread block size, number of threads) I saw there are a lot of mainloops under "cutlass/include/cutlass/gemm/collective" and not sure which once would give optimal performance.

  2. Is there also a guide I can reference to find optimal swizzle layout?

  3. This is another question, just for check up. If I want to do FP16 tensor core operation with FP32 accumulation, is it correct to use C = D = FP32 in MMA and convert to FP16 in epilogue?

Thank you again!