Closed mahmoodn closed 1 year ago
The opcode_hist tool is injecting cudaDeviceSynchronize() at each kernel invocation, so that you can see the "histograms prints" at each kernel execution.
The application you are using, performs stream capture https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#creating-a-graph-using-stream-capture which forbids cudaDeviceSynchronize() within regions of capture and thus the error.
One solution could be to modify the opcode_hist to avoid cudaDeviceSynchronize() in those regions.
The code says:
/* if we are exiting a kernel launch:
* 1. Wait until the kernel is completed using
* cudaDeviceSynchronize()
* 2. Get number of thread blocks in the kernel
* 3. Print the thread instruction counters
* 4. Release the lock*/
CUDA_SAFECALL(cudaDeviceSynchronize());
If I remove cudaDeviceSynchronize()
, then I think the stats become messy because at some point, it doesn't wait for the kernel to finish and still count the number of blocks or other things. Is that right?
I am not able to use NVbit with the RNNT from MLPerf 2.0. Please see the output below:
Any idea about what to do? I have followed the guide from here to build the benchmarks and used the docker version. The device is RTX 3080 and according to nvidia-smi, the driver version on the host (outside of docker) is
NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6
.