accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
289 stars 110 forks source link

Note on Nvbit tracer #223

Open mahmoodn opened 1 year ago

mahmoodn commented 1 year ago

Hi, Based on my observations, NBvit invocations may be different from Nsight Compute. Technically, there is no guarantee that both tools start streams in exactly the same way. For example, based on available resources, it is possible that Nvbit starts streams 1, 2, 3 but Nsight starts streams 1,3,2. This becomes challenging when I want to create the trace of kernel ID based on Nsight Compute output. For example, from Nsight Compute, I want to create the trace for kernel ID 100 where the name is FOO, but when I use NVbit to create the trace for ID 100, I see a different name. I have seen this for some of the MLPerf and complex workloads.

I think for PKA, you also need to know which ID to get the trace and that ID is found from Nsight Compute. Have you seen this before? I know this is not directly related to Accelsim. Just want to share my observation and see if I am the only one the encounters this problem or not.

cesar-avalos3 commented 1 year ago

We saw reordering with respect to Nsight Systems which we ultimately solved by just using the CUDA API order in Nsight Systems instead of the HW order. I don't recall mismatches with respect to Nsight Compute though. We were however, using the first kernel in each group, we didn't really trace far in the workload. Based on other work I am doing at the moment, I can see this happening though, and I'm working towards a possible solution.

We did observe that the latency introduced by the tracer tool caused cuDNN behave non-deterministically, utilizing a different set of kernels compared to those when running it with the profiler, we made a note of it in the paper. I believe this was solved via some environment variable.

A related issue, but not the same, the team observed with Nvbit is a mismatch between stream ID output from Nsys/Ncu and Nvbit itself https://github.com/NVlabs/NVBit/issues/85. This is (hopefully) partly solved in CUDA 12.0 implementing the cuStreamGetId Driver API call.