Open fernandoFernandeSantos opened 2 years ago
Hi,
Instrumenting CUDA events should not be that expensive. I am surprised.
What is the total time beginning-to-end of the application running natively vs. running with the tool above?
I would expect pretty similar time since there is no instrumentation of CUDA functions, but if there is a huge overhead then something is not right.
Maybe we have some inefficient debug code turned on in the NVBit core.
Thanks for reporting this.
What is the total time beginning-to-end of the application running natively vs. running with the tool above?
The total time without the instrumentation is ~5s, counting the rand time. When the instrumentation is applied, the execution time is ~90s.
I would expect pretty similar time since there is no instrumentation of CUDA functions, but if there is a huge overhead then something is not right.
I think that the overhead is coming from the number of events calls. Is it normal to have 112614 calls to a single type of event?
cuModuleGetFunction is used by the driver when loading CUDA functions. cuDNN and other large libraries could have thousands of Functions, so that large number could be correct. We will think if possible to selectively disable Event Callbacks (or to register only to a subset of them) in future versions of NVBit.
Thanks @ovilla. In the meantime, do you know any quick solution that I can apply to at least try to reduce the overhead?
Hi
I am trying to instrument applications that use Pytorch. However, I`m facing some problems with the overhead that NVBIT adds. I have created a simple example (simple_conv.py) below:
Then to measure the overhead added by NVBIT functions call, I created the following dummy (dummy.so). The makefile is based on the mov_replace tool from the NVBIT repository. It is expected to count the number of calls for each event that NVBIT instruments.
I build the dummy.so with CUDA 11.3, GCC 7.5.0, NVBIT 1.5.0, and I run on a Titan V GPU.
I run the tool with the following command:
The result that I got is
I see lots of calls for event 23 (cuModuleGetFunction), which increases the overhead of the NVBIT by a lot.
Is there a way to tell NVBIT to avoid instrumenting some events to prevent unnecessary overhead?
Thanks