RuntimeError: Cuda error: 700[cudaStreamSynchronize(stream);] when calling `peeler.rasterize_next_layer()`

Hi, I've encountered a strange bug when calling peeler.rasterize_next_layer. The code, which is part of a training script, is running in a multi-GPU server environment. Initially, everything was working fine, but as training progressed (around 3 hours), the error suddenly appeared. I looked into similar issues, and some suggest that the problem might be related to the progressively growing internal buffers.

I added dr.set_log_level(0) to my code and observed that the internal buffer size gradually increased from 500MB to 1700MB (without triggering a CUDA error yet). I don't think it's a GPU memory issue, as the network itself uses around 60GB of memory, leaving up to 20GB available for nvdiffrast on a 80GB H100.

I also doubt it's related to invalid data, as I tried some test cases in a notebook, like zero-length vertices and data containing nan or inf, but none of these caused the error. I'm currently really puzzled as to what could be causing this issue and would appreciate any insights. Thanks in advance!

Following is the full log:

2024-08-15T05:19:51.942Z
[5]: out = self.lrm_generator.forward_geometry(
2024-08-15T05:19:51.942Z
[5]: File "/workspace/src/models/lrm_geo.py", line 338, in forward_geometry
2024-08-15T05:19:51.942Z
[5]: antilias_mask, hard_mask, tex_pos, depth, normal = self.render_mesh(mesh_v, mesh_f, cam_mv_and_intrinsics, render_size=render_size)
2024-08-15T05:19:51.942Z
[5]: File "/workspace/src/models/lrm_geo.py", line 297, in render_mesh
2024-08-15T05:19:51.942Z
[5]: return_value = self.geometry.render_mesh(
2024-08-15T05:19:51.942Z
[5]: File "/workspace/src/models/geometry/rep_3d/diso_geometry.py", line 67, in render_mesh
2024-08-15T05:19:51.942Z
[5]: tex_pos, mask, hard_mask, rast, v_pos_clip, mask_pyramid, depth, normal = self.renderer.render_mesh(
2024-08-15T05:19:51.942Z
[5]: File "/workspace/src/models/geometry/render/neural_render.py", line 134, in render_mesh
2024-08-15T05:19:51.942Z
[5]: rast, db = peeler.rasterize_next_layer()
2024-08-15T05:19:51.942Z
[5]: File "/opt/conda/lib/python3.10/site-packages/nvdiffrast/torch/ops.py", line 378, in rasterize_next_layer
2024-08-15T05:19:51.942Z
[5]: result = _rasterize_func.apply(self.raster_ctx, self.pos, self.tri, self.resolution, self.ranges, self.grad_db, self.peeling_idx)
2024-08-15T05:19:51.942Z
[5]: File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
2024-08-15T05:19:51.942Z
[5]: return super().apply(*args, **kwargs) # type: ignore[misc]
2024-08-15T05:19:51.942Z
[5]: File "/opt/conda/lib/python3.10/site-packages/nvdiffrast/torch/ops.py", line 248, in forward
2024-08-15T05:19:51.942Z
[5]: out, out_db = _get_plugin().rasterize_fwd_cuda(raster_ctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
2024-08-15T05:19:51.942Z
[5]:RuntimeError: Cuda error: 700[cudaStreamSynchronize(stream);]
2024-08-15T05:19:52.743Z
[5]:[W CUDAGuardImpl.h:115] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
2024-08-15T05:19:52.743Z
[5]:terminate called after throwing an instance of 'c10::Error'
2024-08-15T05:19:52.743Z
[5]: what(): CUDA error: an illegal memory access was encountered
2024-08-15T05:19:52.743Z
[5]:CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-08-15T05:19:52.743Z
[5]:For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-08-15T05:19:52.743Z
[5]:Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-08-15T05:19:52.743Z
[5]:
2024-08-15T05:19:52.743Z
[5]:Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1696595231861/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
2024-08-15T05:19:52.743Z
[5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f45ae2f7617 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2024-08-15T05:19:52.743Z
[5]:frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f45ae2b298d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2024-08-15T05:19:52.743Z
[5]:frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f45ae3b39f8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
2024-08-15T05:19:52.743Z
[5]:frame #3: <unknown function> + 0x16746 (0x7f45ae37c746 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
2024-08-15T05:19:52.743Z
[5]:frame #4: <unknown function> + 0x1947d (0x7f45ae37f47d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
2024-08-15T05:19:52.743Z
[5]:frame #5: <unknown function> + 0x1989d (0x7f45ae37f89d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
2024-08-15T05:19:52.743Z
[5]:frame #6: <unknown function> + 0x510c06 (0x7f45f541ec06 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
2024-08-15T05:19:52.743Z
[5]:frame #7: <unknown function> + 0x55ca7 (0x7f45ae2dcca7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2024-08-15T05:19:52.743Z
[5]:frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7f45ae2d4cb3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2024-08-15T05:19:52.743Z
[5]:frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f45ae2d4e49 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2024-08-15T05:19:52.743Z
[5]:frame #10: c10d::Reducer::~Reducer() + 0x5b7 (0x7f45ee205b27 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2024-08-15T05:19:52.743Z
[5]:frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f45f5b1d1c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
2024-08-15T05:19:52.743Z
[5]:frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f45f52f0808 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
2024-08-15T05:19:52.743Z
[5]:frame #13: <unknown function> + 0xc11cb1 (0x7f45f5b1fcb1 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
2024-08-15T05:19:52.743Z
[5]:frame #14: <unknown function> + 0x3ec37b (0x7f45f52fa37b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
2024-08-15T05:19:52.743Z
[5]:frame #15: <unknown function> + 0x3ed301 (0x7f45f52fb301 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
2024-08-15T05:19:52.743Z
[5]:frame #16: <unknown function> + 0x131156 (0x5604894da156 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #17: <unknown function> + 0x153b80 (0x5604894fcb80 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #18: <unknown function> + 0x1d67b1 (0x56048957f7b1 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #19: <unknown function> + 0x1fbcb6 (0x5604895a4cb6 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #20: <unknown function> + 0x127692 (0x5604894d0692 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #21: <unknown function> + 0x2092fc (0x5604895b22fc in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #22: Py_FinalizeEx + 0x146 (0x5604895b0c36 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #23: Py_RunMain + 0x106 (0x5604895a3826 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #24: Py_BytesMain + 0x37 (0x560489574527 in /opt/conda/bin/python)
2024-08-15T05:19:52.743Z
[5]:frame #25: __libc_start_main + 0xf3 (0x7f4644c75083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-08-15T05:19:52.743Z
[5]:frame #26: <unknown function> + 0x1cb421 (0x560489574421 in /opt/conda/bin/python)

After further debugging, I identified the faulty input data that caused the CUDA error. Specifically, assuming mesh = network(input), I captured both the input causing the error and the network checkpoint saved closest to the error. Upon investigation, I found that the mesh had an extremely large number of vertices and faces—5 million vertices and 10 million faces. When debugging externally, I observed that nvdiffrast reported generating a 4GB buffer. Therefore, I suspect the issue might indeed be related to GPU memory. Could you suggest any strategies for handling scenarios where the vertex and face counts are exceptionally high?

I created a synthetic merged mesh in a notebook by combining 5 meshes mentioned earlier. When I attempted to rasterize this merged mesh using nvdiffrast, I successfully reproduced the CUDA error. This confirms that the issue is indeed caused by the excessively large mesh, leading to GPU memory problems. Under normal network training conditions, such excessively large meshes wouldn't be generated, so this issue is likely more related to a bug in my network. I guess I need to focus more on network side. But would also be glad to see nvdiffrast handle extreme cases like this more gracefully (e.g., is there an example for just allocating a fixed size buffer at the beginning?). Anyway, thank you!

Which of the buffers is the problem? The triangle/vertex buffers are reallocated to accommodate the incoming data if they're not large enough, so their size should always reflect the largest input seen thus far. The frame buffer is a bit different, as it's resized to accommodate the maximum over each dimension (width, height, minibatch) separately.

The OpenGL/Cuda interop seems to run into problems when allocating and freeing buffers multiple times, leading to gradual accumulation of resource usage — not necessarily GPU memory per se — and an eventual crash. Presumably it is running out of some sort of driver-internal resource that isn't freed up until the process is terminated, so there isn't a lot that can be done on the application side except avoiding reallocations.

To preallocate a buffer, all you need is to call the rasterizer once with the largest input you expect to encounter. The buffer sizes are never reduced, so this should remove the need to expand them later on. The buffers are local to the RasterizeGLContext, so make sure you're reusing the same context in every call to rasterize() or DepthPeeler() instead of creating a new one every time.

I would also suggest trying out the Cuda-based rasterizer (replace RasterizeGLContext with RasterizeCudaContext) if possible. It doesn't use OpenGL and thus won't run into this issue.

Thank you! It seems that I can now initialize a larger preallocated buffer. I have already been using RasterizeCudaContext, which I initialize in the __init__() method of render class and reuse for each rendering call. As for the buffers in question, I don't have much detailed information about their types, aside from a log message: [I RasterImpl.cpp:173] Internal buffers grown to X MB. However, as the rendering solution has not changed, I guess they are vertex buffers.

Ah sorry, I didn't realize you were using the Cuda rasterizer already. Its memory usage is quite complicated and hard to predict, as it depends on how the triangles overlap with tiles and pixels on screen, how they clip against view frustum, and so on. The code detects cases where the internal buffers aren't large enough and resizes them automatically before retrying the operation in question (in function here), which also outputs the message about buffer resize.

I'm guessing the large input leads to some internal indexing arithmetic overflowing, which could easily cause illegal memory accesses and Cuda error 700. The code wasn't designed to tolerate or even detect that situation, so in that sense this is a genuine bug/limitation, and for now the only workaround is to reduce the size of the input.

That said, a simple mesh, say a tessellated sphere, with 5–10 million vertices shouldn't require much internal buffer space, because each triangle would rasterize into only a few pixel tiles. To get excessive memory usage, you'd need many triangles to overlap a large screen area. If this is as intended, you could try rendering the image in smaller pieces to reduce memory usage. If the mesh shouldn't be like that, there might be a bug in how it's constructed.

Hi, I encountered the same CUDA out-of-memory (OOM) error in my project as well. In my code, I render a large mesh three times, and each time it generates numerous images without any backward operations. This error would occur for unknown reasons. While monitoring the GPU memory usage during program execution, I noticed it continuously increased. To address this issue, I added CUDA memory-releasing code and manually deleted the nvdiffrast object after each rendering time. After implementing these changes, my code now runs successfully for larger vertices number (600M), but still failed when vertices number larger than 900M.

My code: del render_obj torch.cuda.empty_cache()

Or: del render_obj torch.cuda.empty_cache() torch.cuda.ipc_collect() torch.cuda.synchronize()

NVlabs / nvdiffrast

RuntimeError: Cuda error: 700[cudaStreamSynchronize(stream);] when calling `peeler.rasterize_next_layer()` #197