CUDA Runtime error when calling `rasterize`

lioutasb commented 2 years ago

I recently switched from Pytorch3D to nvdiffrast for rendering some raster images based on outputs from a deep network. The library works great and it's substantially faster than Pytorch3D. I use singularity containers to containerize my dependences and installing the library is done without any issue. When I train my models locally on my laptop, everything works as expected. My university provides a cluster machine (managed with SLURM) in which I noticed that sometimes some of my training jobs are failing after some random number of training steps with the following error

File "/usr/local/lib/python3.8/dist-packages/nvdiffrast/torch/ops.py", line 246, in rasterize 
       return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1) 
File "/usr/local/lib/python3.8/dist-packages/nvdiffrast/torch/ops.py", line 184, in forward 
      out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 801[cudaGraphicsGLRegisterImage(&s.cudaColorBuffer[i], s.glColorBuffer[i], GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly);]

I was wondering if you have some insight as to why this may be happening. I never had issues with Pytorch3D but nvdiffrast is using OpenGL so I suspect it has something to do with this.

s-laine commented 2 years ago

I haven't seen this problem before. Given that everything works on your laptop, my first hunch is that the cluster environment might have outdated GPU drivers. You may also want to check if on cluster you have older versions of Cuda toolkit or helper libraries such as cuDNN, compared to your laptop, although if everything worked with Pytorch3D, those are probably not the problem.

Running out of GPU memory is a possible cause. The failing function call is at a point where you are rasterizing an image that is larger than previous outputs, either in width, height, or depth (minibatch size), and nvdiffrast has to allocate a new output buffer. If your rasterization call requires an enormous output buffer, it could cause an error like this. To see what is going on, you can add dr.set_log_level(0) at the beginning of your program to print out details of these reallocations (among other stuff) as they occur. If you allocate a bit larger buffers often enough, I guess the memory could get fragmented. To remedy this, you could enforce a maximum size and do one rasterization call that requires it in the beginning - after that no reallocations will be done.

One thing that comes to mind: Are you sure you're not creating a new RasterizeGLContext all the time? You should create just one and keep using it throughout the program. Otherwise your performance will be extremely low, and I wouldn't be surprised if it just stopped working at some point.

Finally, as Cuda launches are asynchronous, I've seen quite a few times that errors elsewhere in the program (e.g., invalid memory accesses from Cuda kernels) are caught at these Cuda/OpenGL interop calls. For debugging these, it is possible to set environment variable CUDA_LAUNCH_BLOCKING=1, which enforces synchronization after every Cuda kernel launch. If there indeed is a bug somewhere else in the program, this makes it easier to pinpoint the buggy kernel. Forcing non-blocking launches decreases performance a lot, so it's not something you want to keep on all the time. But again, if everything worked with Pytorch3D, and nvdiffrast works on your laptop, there would need to be something that makes nvdiffrast bug out in the cluster environment, and I cannot think of anything except GPU drivers.

Hope this helps you forward. Let me know if you find out more clues, I'll be happy to help you hunt this down.

lioutasb commented 2 years ago

Running out of GPU memory is a possible cause. The failing function call is at a point where you are rasterizing an image that is larger than previous outputs, either in width, height, or depth (minibatch size), and nvdiffrast has to allocate a new output buffer. If your rasterization call requires an enormous output buffer, it could cause an error like this. To see what is going on, you can add dr.set_log_level(0) at the beginning of your program to print out details of these reallocations (among other stuff) as they occur. If you allocate a bit larger buffers often enough, I guess the memory could get fragmented. To remedy this, you could enforce a maximum size and do one rasterization call that requires it in the beginning - after that no reallocations will be done.

That was the culprit! With your help, I discovered that every time I was initializing my renderer object (I'm reusing the same RasterizeGLContext) I was rendering accidentally a much higher mini-batch size that depends on a dynamic value that shouldn't be used during training. This led to frequent calls for allocating a new output buffer.

MHassan1122 commented 10 months ago

RuntimeError: Cuda error: 304[cudaGraphicsGLRegisterImage(&s.cudaColorBuffer[i], s.glColorBuffer[i], GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly);]

NVlabs / nvdiffrast

CUDA Runtime error when calling `rasterize` #72