NVlabs / NVBit

198 stars 18 forks source link

NVBit hangs when creating Cuda Contexts in parallel (multi-gpu) #111

Open mktip opened 1 year ago

mktip commented 1 year ago

Tested on: NVBit 1.5.5 and 1.5.4

Greetings,

While playing around with NVBit, I tried a tool I was working on with some multi-gpu code. NVBit seems to hang when multiple Cuda Contexts are being created in parallel.

Here is a sample application, that when instrumented by any nvbit tool on a multi gpu system seems to hang:

#include <iostream>
#include <omp.h>
#include <cuda.h>

int main() {
    std::cout << "Started multi context in parallel" << std::endl;

    int num_devices = 0;
    cudaGetDeviceCount(&num_devices);

#pragma omp parallel num_threads(num_devices)
    {
        int dev_id = omp_get_thread_num();
        std::cout << "Switching to device: " << dev_id << std::endl;
#pragma omp barrier
        cudaSetDevice(dev_id);
        CUcontext ctx;
        cuDevicePrimaryCtxRetain(&ctx, dev_id); // or `cudaFree(0)` just to force the instantiation of the cuda context
#pragma omp barrier
        std::cout << "Switched to device: "  << dev_id << std::endl;
    }

    std::cout << "Ended multi context in parallel" << std::endl;

    return 0;
}

build and run (might require multiple runs):

$ nvcc  -lcuda -lcudart -Xcompiler=-fopenmp -O3 -arch=sm_75 pmcs.cu -o pmcs
$ LD_PRELOAD=./tools/noop/noop.so ./pmcs # or any other tool within ./tools could be used

Note: noop.so is a tool which does nothing. The instrumentation functions simply return when entered.

I've tried to investigate where the hanging occurs, and the only thing I was able to trace it to is a cudaDeviceSynchronize in ./core/nvbit_tool.h +82, but I think it is a symptom rather than the cause:

...

extern "C" void nvbit_at_context_init_hook() {
    __nvbit_start();
    load_module_nvbit_kernel<<<1, 1>>>(0);
    cudaDeviceSynchronize();  // this line
    assert(cudaGetLastError() == cudaSuccess);
}

If this problem is known, are there any known workarounds for it?