[QST] Understanding `sgemm_sm80.cu` with NVIDIA Nsight Compute

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

Other

5.65k stars 964 forks source link

What is your question? I am running the sgemm_sm80.cu example (debug build) with NCU. I want to see the part of the kernel that ends up executing the HFMA2 instruction and do some analysis on that. My understanding is that it should be the gemm call in the main loop that should end up in the HFMA2 but based on what I see in NCU, that piece of code is not showing me any associated SASS code. Instead, the HFMA2 seems to be stemming from some copying atom related code. Is that expected? If so, why is it not the gemm call ending up in the HFMA2 instruction?

Below are two relevant screenshots:

NVIDIA / cutlass

[QST] Understanding `sgemm_sm80.cu` with NVIDIA Nsight Compute #1914