NVlabs / NVBit

198 stars 18 forks source link

Avoid instrument some events #79

Open fernandoFernandeSantos opened 2 years ago

fernandoFernandeSantos commented 2 years ago

Hi

I am trying to instrument applications that use Pytorch. However, I`m facing some problems with the overhead that NVBIT adds. I have created a simple example (simple_conv.py) below:

#!/usr/bin/python3.8
import os, time, torch

conv_2d = torch.nn.Conv2d(in_channels=384, out_channels=256, kernel_size=3, stride=1, padding=1)
input_tensor = torch.randn(1, 384, 256, 9)
tic = time.time()
input_tensor = input_tensor.to(device="cuda")
conv_2d = conv_2d.to(device="cuda")
toc = time.time() - tic
print("GPU copy time", toc)
tic = time.time()
output = conv_2d(input_tensor)
toc = time.time() - tic
print("GPU time", toc)
print(output.to("cpu").sum())

Then to measure the overhead added by NVBIT functions call, I created the following dummy (dummy.so). The makefile is based on the mov_replace tool from the NVBIT repository. It is expected to count the number of calls for each event that NVBIT instruments.

#include <assert.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <vector>
#include "nvbit_tool.h"
#include "nvbit.h"

std::vector<uint32_t> event_counter;

void nvbit_at_init() {
    event_counter = std::vector<uint32_t>(650, 0); // 650 is from tools_cuda_api_meta.h
}

void nvbit_at_cuda_event(CUcontext ctx, int is_exit, nvbit_api_cuda_t cbid, const char *name, void *params, CUresult *pStatus) {
        event_counter[cbid]++;
}

void nvbit_at_term() {
    for(size_t k = 0; k < event_counter.size(); k++)
        if(event_counter[k] != 0)
            std::cout << "Event code:" << k << " Event counter:" << event_counter[k] << std::endl;
} 

I build the dummy.so with CUDA 11.3, GCC 7.5.0, NVBIT 1.5.0, and I run on a Titan V GPU.

I run the tool with the following command:

eval CUDA_INJECTION64_PATH=<path to dummy>/dummy.so ./simple_conv.py 

The result that I got is

------------- NVBit (NVidia Binary Instrumentation Tool v1.5.5) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
GPU copy time 88.82273077964783
GPU time 0.0409548282623291
tensor(487.4692, grad_fn=<SumBackward0>)
Event code:9 Event counter:74
Event code:16 Event counter:98
Event code:23 Event counter:112614
Event code:39 Event counter:2
Event code:118 Event counter:60
Event code:119 Event counter:2
Event code:124 Event counter:16
Event code:126 Event counter:8
Event code:216 Event counter:8
Event code:241 Event counter:4604
Event code:243 Event counter:18
Event code:245 Event counter:6
Event code:247 Event counter:2
Event code:277 Event counter:6
Event code:279 Event counter:2
Event code:304 Event counter:2
Event code:307 Event counter:6
Event code:367 Event counter:8
Event code:370 Event counter:2
Event code:386 Event counter:1
Event code:499 Event counter:4

I see lots of calls for event 23 (cuModuleGetFunction), which increases the overhead of the NVBIT by a lot.

Is there a way to tell NVBIT to avoid instrumenting some events to prevent unnecessary overhead?

Thanks

ovilla commented 2 years ago

Hi,

Instrumenting CUDA events should not be that expensive. I am surprised.

What is the total time beginning-to-end of the application running natively vs. running with the tool above?

I would expect pretty similar time since there is no instrumentation of CUDA functions, but if there is a huge overhead then something is not right.

Maybe we have some inefficient debug code turned on in the NVBit core.

Thanks for reporting this.

fernandoFernandeSantos commented 2 years ago

What is the total time beginning-to-end of the application running natively vs. running with the tool above?

The total time without the instrumentation is ~5s, counting the rand time. When the instrumentation is applied, the execution time is ~90s.

I would expect pretty similar time since there is no instrumentation of CUDA functions, but if there is a huge overhead then something is not right.

I think that the overhead is coming from the number of events calls. Is it normal to have 112614 calls to a single type of event?

ovilla commented 2 years ago

cuModuleGetFunction is used by the driver when loading CUDA functions. cuDNN and other large libraries could have thousands of Functions, so that large number could be correct. We will think if possible to selectively disable Event Callbacks (or to register only to a subset of them) in future versions of NVBit.

fernandoFernandeSantos commented 2 years ago

Thanks @ovilla. In the meantime, do you know any quick solution that I can apply to at least try to reduce the overhead?