How to use NVTX in device code.

jiangxiaobin96 commented 11 months ago

Can NVTX use in device code and how?

jcohen-nvidia commented 11 months ago

Hi Xiaobin,

NVTX is not usable from device code, by design. We've thought a lot about how we could implement this, but it would guaranteeably cause unacceptable performance degradation. Keep in mind that CUDA kernels make best use of the GPU when they launch thousands or millions of threads. Imagine a 1-million-thread kernel with a single NVTX range, using nvtxRangePushA and nvtxRangePop, running under a trace tool like Nsight Systems. Just one kernel launch like this would produce 2 million trace records, and the Push records contain a string of unbounded length. With two 8-byte timestamps and perhaps 32 more bytes of other data (already optimistically low), that's 40 bytes per range. So for the whole kernel, that's 40 MB of trace data being generated on the GPU, and that data would have to be transferred from device-memory to host-memory, and then to disk -- all without harming performance. It's already challenging to keep the overhead of tracing just the CUDA kernel start & end times to less than 1 µs per kernel launch, and the overhead added by NVTX calls would be much worse, scaling up with the number of NVTX calls times the number of threads.

This is why we think it's not a plausible solution to the problem of investigating performance within a CUDA kernel. For that problem, I recommend using Nsight Compute to analyze specific kernels. You could use Nsight Systems first to identify which kernels are the bottlenecks in your application, and then use NVTX ranges or other tricks to focus Nsight Compute's deep-dive analysis on just the problematic kernels.

If you really do want to track progress of the many device code threads arriving at a certain line of code during a kernel's execution, there is a way to do that -- the "pmevent" instruction, which is callable using inline PTX assembly: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-pmevent This instruction increments a GPU hardware counter any time a CUDA device-code thread issues it. If you place this instruction at an important point in your device function, then you'd expect the hardware counter to go from zero at the kernel start time up to the total number of threads once they've all passed that line. You can then use tools like Nsight Systems, Nsight Compute, or the PerfWorks library to sample these counters and graph them over time, so during a kernel launch you can see the progress. The affect on kernel performance should be insignificant with this approach, even with huge numbers of pmevents. This approach to device-code perf analysis isn't one I'd recommend, though... I encourage you to use Nsight Compute, learn what its reported metrics mean, and how to change your code to avoid perf issues it highlights -- the documentation will help you understand the inner workings of the GPU and how to write efficient device code.

I tried to briefly describe this in the NVTX docs here: https://github.com/NVIDIA/NVTX#which-platforms-does-nvtx-support ...but I should probably pull this topic out to be its own section, because this is a request we get often. And when describing pmevent, I should also make sure all relevant tools have documentation for how to collect & display that counter, and then provide links from the NVTX docs to the tool-specific docs for that.

jiangxiaobin96 commented 11 months ago

greatly appreciate!

NVIDIA / NVTX

How to use NVTX in device code. #83