Closed LeslieZhoa closed 3 years ago
The first invocation of rasterize()
involves some setup such as creating the vertex and index buffers and registering them to the Cuda side. If you're using TensorFlow, the first invocation also creates the OpenGL context which is a pretty heavy operation, whereas in Torch this cost occurs when you create the RasterizeGLContext
object. All of these steps are necessary for nvdiffrast to function, so there's no simple way to reduce the startup cost. If this interferes with your timing, perhaps you can do a dummy rasterize()
operation before entering the main timing loop.
@s-laine thanks for your reply!!!
Hi @s-laine I have another question. When I loop the warpaffine function, the gpu memory is bigger and bigger. How can I release it?
Looking forward to your reply again.
The code snippet looks good to me, so my guess is that you are accidentally retaining (possibly indirect) references to Torch tensors from previous iterations in your loop. This prevents the Python interpreter from deleting them, which in turn prevents Torch from deallocating the associated GPU buffers. See the first question in PyTorch FAQ for more information.
I try to package the model as follows: When I loop the warpaffine function, the gpu memory is still bigger and bigger. Can you point out my question? Thank you! Hope your relpy!!!
I don't see a problem in this code either, so I'm still assuming there's something wrong in the loop that calls this code. Perhaps you can try to list all Torch tensors in memory and see if there's more and more of them as the loop is repeated? There's a snippet here that shows how to do that, but I haven't tried it myself so I cannot tell how well this works.
Are you by any chance using nn.DataParallel in your program? There's a separate issue about an unexplained memory leak there that I cannot unfortunately debug myself at this time.
We have done training runs taking multiple hours and performing millions of nvdiffrast calls, without any memory leaks. Thus there's probably either a problem with your code, or you're using nvdiffrast in a way that triggers a previously unseen bug. Based on the code snippet above I cannot tell which one is the case.
@s-laine I find that if I fix the input shape, the gpu memory will keep a certain number. When the input shape is actual size, the memory will be bigger and bigger. Must I keep the input shape a certain number?
There's no need to keep the input size fixed.
I just tried running 100k iterations of forward and backward rasterization+interpolation+texture with randomized resolutions for both viewport and texture, and randomized triangle counts, and I don't see any signs of a memory leak. I don't know what could explain the behavior you're seeing, but I still suspect you're retaining references to stale tensors somewhere in your code.
For reference, here's the code I used for testing.
import numpy as np
import torch
import nvdiffrast.torch as dr
glctx = dr.RasterizeGLContext()
device = 'cuda:0'
for it in range(100000):
num_tri = np.random.randint(1, 2**16)
res_x = np.random.randint(32, 1024)
res_y = np.random.randint(32, 1024)
tex_x = np.random.randint(32, 1024)
tex_y = np.random.randint(32, 1024)
pos = torch.randn([1, num_tri*3, 3], dtype=torch.float32, device=device, requires_grad=True)
pos = torch.nn.functional.pad(pos, (0, 1), value=1.0)
uv = torch.rand(size=[1, num_tri*3, 2], dtype=torch.float32, device=device, requires_grad=True)
tex = torch.rand(size=[1, tex_y, tex_x, 3], dtype=torch.float32, device=device, requires_grad=True)
tri = torch.arange(0, num_tri*3, dtype=torch.int32, device=device).reshape([num_tri, 3])
rast_out, _ = dr.rasterize(glctx, pos, tri, resolution=[res_y, res_x])
uv_out, _ = dr.interpolate(uv, rast_out, tri)
tex_out = dr.texture(tex, uv_out, filter_mode='linear')
loss = torch.mean(tex_out)
loss.backward()
if not it % 1000:
print('iter: %d' % it)
Thanks for your code. I have a question that why the time is so long at the first running time. I run the code in the docker which you provide. I just realize a function which is like warpaffine. The running time is as follows
Can I avoid the cold boot?
Looking forward to your reply.