While playing around with NVBit, I tried a tool I was working on with some multi-gpu code. NVBit seems to hang when multiple Cuda Contexts are being created in parallel.
Here is a sample application, that when instrumented by any nvbit tool on a multi gpu system seems to hang:
#include <iostream>
#include <omp.h>
#include <cuda.h>
int main() {
std::cout << "Started multi context in parallel" << std::endl;
int num_devices = 0;
cudaGetDeviceCount(&num_devices);
#pragma omp parallel num_threads(num_devices)
{
int dev_id = omp_get_thread_num();
std::cout << "Switching to device: " << dev_id << std::endl;
#pragma omp barrier
cudaSetDevice(dev_id);
CUcontext ctx;
cuDevicePrimaryCtxRetain(&ctx, dev_id); // or `cudaFree(0)` just to force the instantiation of the cuda context
#pragma omp barrier
std::cout << "Switched to device: " << dev_id << std::endl;
}
std::cout << "Ended multi context in parallel" << std::endl;
return 0;
}
build and run (might require multiple runs):
$ nvcc -lcuda -lcudart -Xcompiler=-fopenmp -O3 -arch=sm_75 pmcs.cu -o pmcs
$ LD_PRELOAD=./tools/noop/noop.so ./pmcs # or any other tool within ./tools could be used
Note: noop.so is a tool which does nothing. The instrumentation functions simply return when entered.
I've tried to investigate where the hanging occurs, and the only thing I was able to trace it to is a cudaDeviceSynchronize in ./core/nvbit_tool.h +82, but I think it is a symptom rather than the cause:
...
extern "C" void nvbit_at_context_init_hook() {
__nvbit_start();
load_module_nvbit_kernel<<<1, 1>>>(0);
cudaDeviceSynchronize(); // this line
assert(cudaGetLastError() == cudaSuccess);
}
If this problem is known, are there any known workarounds for it?
Greetings,
While playing around with NVBit, I tried a tool I was working on with some multi-gpu code. NVBit seems to hang when multiple Cuda Contexts are being created in parallel.
Here is a sample application, that when instrumented by any nvbit tool on a multi gpu system seems to hang:
build and run (might require multiple runs):
I've tried to investigate where the hanging occurs, and the only thing I was able to trace it to is a
cudaDeviceSynchronize
in./core/nvbit_tool.h +82
, but I think it is a symptom rather than the cause:If this problem is known, are there any known workarounds for it?