[QST] Difference between rs and ss

haeunlee99 commented 5 days ago

What is your question?

What is difference between rs and ss? For example how are those two files different?

sm90_mma_array_tma_gmma_ss_warpspecialized.hpp sm90_mma_multistage_gmma_rs_warpspecialized.hpp

Thanks!

thakkarV commented 5 days ago

RS uses Hopper MMAs that source A from registers. SS uses MMAs that source A from smem

haeunlee99 commented 4 days ago

Thank you for your response. I thought in hopper SS is used by default. I have several follow up questions.

Which of RS and SS is considered to be optimal? Is there a guide to reproduce cuBLAS like near optimal performance? (e.g. shared memory swizzle, thread block size, number of threads) I saw there are a lot of mainloops under "cutlass/include/cutlass/gemm/collective" and not sure which once would give optimal performance.
Is there also a guide I can reference to find optimal swizzle layout?
This is another question, just for check up. If I want to do FP16 tensor core operation with FP32 accumulation, is it correct to use C = D = FP32 in MMA and convert to FP16 in epilogue?

Thank you again!

NVIDIA / cutlass