About the first running time

LeslieZhoa commented 3 years ago

Thanks for your code. I have a question that why the time is so long at the first running time. I run the code in the docker which you provide. I just realize a function which is like warpaffine. The running time is as follows

image shape : 2448 3264 warp_img time is 0.318
image shape : 1024 1024 warp_img time is 0.019
image shape : 720 1407 warp_img time is 0.020
image shape : 2448 3264 warp_img time is 0.085
image shape : 1920 2560 warp_img time is 0.055
image shape : 634 951 warp_img time is 0.010
image shape : 1944 2592 warp_img time is 0.056
image shape : 900 1600 warp_img time is 0.022
image shape : 720 1424 warp_img time is 0.018

Can I avoid the cold boot？

Looking forward to your reply.

s-laine commented 3 years ago

The first invocation of rasterize() involves some setup such as creating the vertex and index buffers and registering them to the Cuda side. If you're using TensorFlow, the first invocation also creates the OpenGL context which is a pretty heavy operation, whereas in Torch this cost occurs when you create the RasterizeGLContext object. All of these steps are necessary for nvdiffrast to function, so there's no simple way to reduce the startup cost. If this interferes with your timing, perhaps you can do a dummy rasterize() operation before entering the main timing loop.

LeslieZhoa commented 3 years ago

@s-laine thanks for your reply!!!

LeslieZhoa commented 3 years ago

Hi @s-laine I have another question. When I loop the warpaffine function, the gpu memory is bigger and bigger. How can I release it? 6964666341661630522~1

Looking forward to your reply again.

s-laine commented 3 years ago

The code snippet looks good to me, so my guess is that you are accidentally retaining (possibly indirect) references to Torch tensors from previous iterations in your loop. This prevents the Python interpreter from deleting them, which in turn prevents Torch from deallocating the associated GPU buffers. See the first question in PyTorch FAQ for more information.

LeslieZhoa commented 3 years ago

I try to package the model as follows: When I loop the warpaffine function, the gpu memory is still bigger and bigger. Can you point out my question? Thank you! Hope your relpy!!!

s-laine commented 3 years ago

I don't see a problem in this code either, so I'm still assuming there's something wrong in the loop that calls this code. Perhaps you can try to list all Torch tensors in memory and see if there's more and more of them as the loop is repeated? There's a snippet here that shows how to do that, but I haven't tried it myself so I cannot tell how well this works.

Are you by any chance using nn.DataParallel in your program? There's a separate issue about an unexplained memory leak there that I cannot unfortunately debug myself at this time.

We have done training runs taking multiple hours and performing millions of nvdiffrast calls, without any memory leaks. Thus there's probably either a problem with your code, or you're using nvdiffrast in a way that triggers a previously unseen bug. Based on the code snippet above I cannot tell which one is the case.

LeslieZhoa commented 3 years ago

@s-laine I find that if I fix the input shape, the gpu memory will keep a certain number. When the input shape is actual size, the memory will be bigger and bigger. Must I keep the input shape a certain number?

s-laine commented 3 years ago

There's no need to keep the input size fixed.

I just tried running 100k iterations of forward and backward rasterization+interpolation+texture with randomized resolutions for both viewport and texture, and randomized triangle counts, and I don't see any signs of a memory leak. I don't know what could explain the behavior you're seeing, but I still suspect you're retaining references to stale tensors somewhere in your code.

For reference, here's the code I used for testing.

import numpy as np
import torch
import nvdiffrast.torch as dr

glctx = dr.RasterizeGLContext()
device = 'cuda:0'

for it in range(100000):
    num_tri = np.random.randint(1, 2**16)
    res_x = np.random.randint(32, 1024)
    res_y = np.random.randint(32, 1024)
    tex_x = np.random.randint(32, 1024)
    tex_y = np.random.randint(32, 1024)

    pos = torch.randn([1, num_tri*3, 3], dtype=torch.float32, device=device, requires_grad=True)
    pos = torch.nn.functional.pad(pos, (0, 1), value=1.0)
    uv  = torch.rand(size=[1, num_tri*3, 2], dtype=torch.float32, device=device, requires_grad=True)
    tex = torch.rand(size=[1, tex_y, tex_x, 3], dtype=torch.float32, device=device, requires_grad=True)
    tri = torch.arange(0, num_tri*3, dtype=torch.int32, device=device).reshape([num_tri, 3])

    rast_out, _ = dr.rasterize(glctx, pos, tri, resolution=[res_y, res_x])
    uv_out, _ = dr.interpolate(uv, rast_out, tri)
    tex_out = dr.texture(tex, uv_out, filter_mode='linear')

    loss = torch.mean(tex_out)
    loss.backward()

    if not it % 1000:
        print('iter: %d' % it)

NVlabs / nvdiffrast

About the first running time #27