NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.65k stars 964 forks source link

[QST] Understanding `sgemm_sm80.cu` with NVIDIA Nsight Compute #1914

Open gohar94 opened 1 week ago

gohar94 commented 1 week ago

What is your question? I am running the sgemm_sm80.cu example (debug build) with NCU. I want to see the part of the kernel that ends up executing the HFMA2 instruction and do some analysis on that. My understanding is that it should be the gemm call in the main loop that should end up in the HFMA2 but based on what I see in NCU, that piece of code is not showing me any associated SASS code. Instead, the HFMA2 seems to be stemming from some copying atom related code. Is that expected? If so, why is it not the gemm call ending up in the HFMA2 instruction?

Below are two relevant screenshots: Image Image

osayamenja commented 1 week ago

Have you correlated with real.hpp? That file is where the basic fma is defined, I experimented using gemm_tn and I get the below, which is expected, I think.

Image

From the above, I do not see any HFMA, instructions as it seems, cute does not dispatch to those, but rather HADD, FADD and HMUL.

On the other hand, I verified HMMA SASS, by experimenting with halfXhalf=float; I get the below, which is expected as cute::gemm dispatches to this mma instruction.

Image