NVlabs / nvdiffrast

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering
Other
1.35k stars 144 forks source link

Potential memory leak with rasterize() #30

Closed bathal1 closed 3 years ago

bathal1 commented 3 years ago

Hello, When running several optimizations in a script I noticed that my GPU eventually runs out of memory, thus causing the script to fail. From looking at nvidia-smi after each run of an optimization, it seems that some memory is never freed (except when the process is killed of course).

Here is a minimal reproducer:

import nvdiffrast.torch as dr
import torch

def render_dummy():
    glctx = dr.RasterizeGLContext()
    # Create the NDCs of one dummy triangle seen from 16 dummy viewpoints
    v = torch.ones((16,3,4), device='cuda')
    f = torch.tensor([[0,1,2]], device='cuda', dtype=torch.int32)
    dr.rasterize(glctx, v, f, (1080, 1920))

Then, running

render_dummy()
torch.cuda.empty_cache()

several times in a jupyter notebook, and checking nvidia-smi in between calls shows the growing memory used by the process.

Alternatively, running

for i in range(20):
    render_dummy()
    torch.cuda.empty_cache()

Should be enough to make the GPU run out of memory (I have a Titan RTX on my end).

The size of the leak seems to be proportional to the number of viewpoints or the resolution, which makes me suspect that the framebuffer is not properly freed (provided I'm not to blame here 😅). For example, with the resolution and viewpoints in the example above, the leak on my end is 1080MiB large, which is pretty close to the size of the result of rasterize.

Also, here's the log output from running the dummy rendering function once with dr.set_log_level(0):

[I glutil.cpp:322] Creating GL context for Cuda device 0
[I glutil.cpp:370] EGL 5.1 OpenGL context created (disp: 0x0000555af6bcdd70, ctx: 0x0000555af6cf7141)
[I rasterize.cpp:91] OpenGL version reported as 4.6
[I rasterize.cpp:332] Increasing position buffer size to 192 float32
[I rasterize.cpp:343] Increasing triangle buffer size to 64 int32
[I rasterize.cpp:368] Increasing frame buffer size to (width, height, depth) = (1920, 1088, 16)
[I rasterize.cpp:394] Increasing range array size to 64 elements
[I glutil.cpp:391] EGL OpenGL context destroyed (disp: 0x0000555af6bcdd70, ctx: 0x0000555af6cf7141)

I initially noticed this behavior using nvidffrast v0.2.0, but I since updated to 0.2.5, which didn't change anything.

s-laine commented 3 years ago

Hi @bathal1! Thanks for the report and repro. I can confirm the issue on Windows, and there is indeed a memory leak in destroying the GL context because CUDA-mapped graphics resources are apparently not freed when the context is destroyed.

Note that creating a GL context is a slow operation, and you should not do it at every call to rasterize() but only once, and keep reusing the same context. However, if your use case requires frequent context creation/destruction, please try the following patch until we make a new release:

In nvdiffrast/torch/torch_rasterize.cpp, add the following lines:

  RasterizeGLStateWrapper::~RasterizeGLStateWrapper(void)
  {
+     setGLContext(pState->glctx);
+     rasterizeReleaseBuffers(NVDR_CTX_PARAMS, *pState);
+     releaseGLContext();
      destroyGLContext(pState->glctx);
      delete pState;
  }

In nvdiffrast/common/rasterize.h, add the following line:

  void rasterizeInitGLContext(NVDR_CTX_ARGS, RasterizeGLState& s, int cudaDeviceIdx);
  void rasterizeResizeBuffers(NVDR_CTX_ARGS, RasterizeGLState& s, int posCount, int triCount, int width, int height, int depth);
  void rasterizeRender(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, const float* posPtr, int posCount, int vtxPerInstance, const int32_t* triPtr, int triCount, const int32_t* rangesPtr, int width, int height, int depth, int peeling_idx);
  void rasterizeCopyResults(NVDR_CTX_ARGS, RasterizeGLState& s, cudaStream_t stream, float** outputPtr, int width, int height, int depth);
+ void rasterizeReleaseBuffers(NVDR_CTX_ARGS, RasterizeGLState& s);

Finally, in nvdiffrast/common/rasterize.cpp, add the following function:


void rasterizeReleaseBuffers(NVDR_CTX_ARGS, RasterizeGLState& s)
{
    int num_outputs = s.enableDB ? 2 : 1;

    if (s.cudaPosBuffer)
    {
        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaPosBuffer));
        s.cudaPosBuffer = 0;
    }

    if (s.cudaTriBuffer)
    {
        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaTriBuffer));
        s.cudaTriBuffer = 0;
    }

    for (int i=0; i < num_outputs; i++)
    {
        if (s.cudaColorBuffer[i])
        {
            NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaColorBuffer[i]));
            s.cudaColorBuffer[i] = 0;
        }
    }

    if (s.cudaPrevOutBuffer)
    {
        NVDR_CHECK_CUDA_ERROR(cudaGraphicsUnregisterResource(s.cudaPrevOutBuffer));
        s.cudaPrevOutBuffer = 0;
    }
}

On my computer this leads to GPU memory usage remaining fixed over iterations of render_dummy() as expected. Please let me know if you still experience problems — I have not tested this on Linux.

bathal1 commented 3 years ago

Hi @s-laine, thanks for your quick reply! I just tried the fix on my end (Ubuntu 20.04) and it works like a charm, thanks a lot!

Just to clarify: I do not create the GL context at every rendering call in my actual code, it's only created once per call of my "main" optimization function. The problem arises when I call this function several times in a row, e.g. when doing some parameter search.

s-laine commented 3 years ago

Great to hear that this solved the problem for you. I'll keep this issue open until we have released a version that includes the fix.

s-laine commented 3 years ago

Fix included in v0.2.6, closing.