Open f1shel opened 1 month ago
After further debugging, I identified the faulty input data that caused the CUDA error. Specifically, assuming mesh = network(input), I captured both the input causing the error and the network checkpoint saved closest to the error. Upon investigation, I found that the mesh had an extremely large number of vertices and faces—5 million vertices and 10 million faces. When debugging externally, I observed that nvdiffrast reported generating a 4GB buffer. Therefore, I suspect the issue might indeed be related to GPU memory. Could you suggest any strategies for handling scenarios where the vertex and face counts are exceptionally high?
I created a synthetic merged mesh in a notebook by combining 5 meshes mentioned earlier. When I attempted to rasterize this merged mesh using nvdiffrast, I successfully reproduced the CUDA error. This confirms that the issue is indeed caused by the excessively large mesh, leading to GPU memory problems. Under normal network training conditions, such excessively large meshes wouldn't be generated, so this issue is likely more related to a bug in my network. I guess I need to focus more on network side. But would also be glad to see nvdiffrast handle extreme cases like this more gracefully (e.g., is there an example for just allocating a fixed size buffer at the beginning?). Anyway, thank you!
Which of the buffers is the problem? The triangle/vertex buffers are reallocated to accommodate the incoming data if they're not large enough, so their size should always reflect the largest input seen thus far. The frame buffer is a bit different, as it's resized to accommodate the maximum over each dimension (width, height, minibatch) separately.
The OpenGL/Cuda interop seems to run into problems when allocating and freeing buffers multiple times, leading to gradual accumulation of resource usage — not necessarily GPU memory per se — and an eventual crash. Presumably it is running out of some sort of driver-internal resource that isn't freed up until the process is terminated, so there isn't a lot that can be done on the application side except avoiding reallocations.
To preallocate a buffer, all you need is to call the rasterizer once with the largest input you expect to encounter. The buffer sizes are never reduced, so this should remove the need to expand them later on. The buffers are local to the RasterizeGLContext
, so make sure you're reusing the same context in every call to rasterize()
or DepthPeeler()
instead of creating a new one every time.
I would also suggest trying out the Cuda-based rasterizer (replace RasterizeGLContext
with RasterizeCudaContext
) if possible. It doesn't use OpenGL and thus won't run into this issue.
Thank you! It seems that I can now initialize a larger preallocated buffer. I have already been using RasterizeCudaContext
, which I initialize in the __init__()
method of render class and reuse for each rendering call. As for the buffers in question, I don't have much detailed information about their types, aside from a log message: [I RasterImpl.cpp:173] Internal buffers grown to X MB
. However, as the rendering solution has not changed, I guess they are vertex buffers.
Ah sorry, I didn't realize you were using the Cuda rasterizer already. Its memory usage is quite complicated and hard to predict, as it depends on how the triangles overlap with tiles and pixels on screen, how they clip against view frustum, and so on. The code detects cases where the internal buffers aren't large enough and resizes them automatically before retrying the operation in question (in function here), which also outputs the message about buffer resize.
I'm guessing the large input leads to some internal indexing arithmetic overflowing, which could easily cause illegal memory accesses and Cuda error 700. The code wasn't designed to tolerate or even detect that situation, so in that sense this is a genuine bug/limitation, and for now the only workaround is to reduce the size of the input.
That said, a simple
mesh, say a tessellated sphere, with 5–10 million vertices shouldn't require much internal buffer space, because each triangle would rasterize into only a few pixel tiles. To get excessive memory usage, you'd need many triangles to overlap a large screen area. If this is as intended, you could try rendering the image in smaller pieces to reduce memory usage. If the mesh shouldn't be like that, there might be a bug in how it's constructed.
Hi, I encountered the same CUDA out-of-memory (OOM) error in my project as well. In my code, I render a large mesh three times, and each time it generates numerous images without any backward operations. This error would occur for unknown reasons. While monitoring the GPU memory usage during program execution, I noticed it continuously increased. To address this issue, I added CUDA memory-releasing code and manually deleted the nvdiffrast object after each rendering time. After implementing these changes, my code now runs successfully for larger vertices number (600M), but still failed when vertices number larger than 900M.
My code: del render_obj torch.cuda.empty_cache()
Or: del render_obj torch.cuda.empty_cache() torch.cuda.ipc_collect() torch.cuda.synchronize()
Hi, I've encountered a strange bug when calling
peeler.rasterize_next_layer
. The code, which is part of a training script, is running in a multi-GPU server environment. Initially, everything was working fine, but as training progressed (around 3 hours), the error suddenly appeared. I looked into similar issues, and some suggest that the problem might be related to the progressively growing internal buffers.I added
dr.set_log_level(0)
to my code and observed that the internal buffer size gradually increased from 500MB to 1700MB (without triggering a CUDA error yet). I don't think it's a GPU memory issue, as the network itself uses around 60GB of memory, leaving up to 20GB available for nvdiffrast on a 80GB H100.I also doubt it's related to invalid data, as I tried some test cases in a notebook, like zero-length vertices and data containing nan or inf, but none of these caused the error. I'm currently really puzzled as to what could be causing this issue and would appreciate any insights. Thanks in advance!
Following is the full log: