fxia22 / pointGAN

point set generative adversarial nets
MIT License
146 stars 27 forks source link

nndistance may cause CUDA memory leak #6

Open WangZixuan opened 6 years ago

WangZixuan commented 6 years ago

I used your nndistance in my work but got a memory leak error, so I test the code below.

class Test():
    def __init__(self):
        super(Test, self).__init__()
        self.dist = NNDModule()

    def run(self):
        for i in range(0, 20000):
            p1 = torch.rand(1, 1024, 3) * 20
            p2 = torch.rand(1, 2048, 3) * 10
            p2 = p2.float()

            print('gpu')
            points1_cuda = Variable(p1.cuda(), requires_grad=True)
            points2_cuda = Variable(p2.cuda())
            dist1_cuda, dist2_cuda = self.dist(points1_cuda, points2_cuda)
            print(dist1_cuda, dist2_cuda)
            loss_cuda = torch.sum(dist1_cuda)
            print(loss_cuda)
            loss_cuda.backward()
            print(points1_cuda.grad, points2_cuda.grad)

if __name__ == '__main__':
    t = Test()
    t.run()

With the loop, cuda memory used increases.

fxia22 commented 6 years ago

thanks for the feedback, I will double check.

fxia22 commented 6 years ago

Can you let me know your software stack version? @WangZixuan

WangZixuan commented 6 years ago

I tested on Ubuntu 16.04, with Nvidia 1080Ti, Cuda compilation tools release V8.0.61, Cuda Version 9.0.176

fxia22 commented 6 years ago

thanks, that's helpful, what about pytorch version? @WangZixuan

WangZixuan commented 6 years ago

Pytorch 0.4.0, torchvision 0.2.0

pclucas14 commented 6 years ago

Hi, did you find a fix for this ?

WangZixuan commented 6 years ago

@pclucas14 No, but I wrote Chamfer Distance using Pytorch on my own, please check https://gist.github.com/WangZixuan/4c4cdf49ce9989175e94524afc946726. You can try it. Should you have any problems, feel free to contact me.

pclucas14 commented 6 years ago

Thanks for sharing code. I'm working for very large point clouds, therefore a O(N) memory implementation is crucial.

tejaskhot commented 6 years ago

Hi! Did any of you get emd to work without a memory leak? @fxia22 any luck with the fix? Thanks!

pclucas14 commented 6 years ago

I rewrote the wrapper for emd, which does not give a memory leak. However, I have not tested it. You can find it here

pxiangwu commented 6 years ago

Hi @WangZixuan @pclucas14 @tejaskhot

The memory leak problem is caused by incorrect use of Pytorch 0.4.

In the file functions/nnd.py, we should not use "self" anymore, but instead the "ctx" to store the internal values. So it can be fixed as follows:

class NNDFunction(Function):
    def forward(ctx, xyz1, xyz2):
        batchsize, n, _ = xyz1.size()
        _, m, _ = xyz2.size()`

        dist1 = torch.zeros(batchsize, n)
        dist2 = torch.zeros(batchsize, m)

        idx1 = torch.zeros(batchsize, n).type(torch.IntTensor)
        idx2 = torch.zeros(batchsize, m).type(torch.IntTensor)

        if not xyz1.is_cuda:
            my_lib.nnd_forward(xyz1, xyz2, dist1, dist2, idx1, idx2)
        else:
            dist1 = dist1.cuda()
            dist2 = dist2.cuda()
            idx1 = idx1.cuda()
            idx2 = idx2.cuda()
            my_lib.nnd_forward_cuda(xyz1, xyz2, dist1, dist2, idx1, idx2)

        ctx.save_for_backward(xyz1, xyz2, idx1, idx2, dist1, dist2)

        return dist1, dist2

    def backward(ctx, graddist1, graddist2):
        xyz1, xyz2, idx1, idx2, dist1, dist2 = ctx.saved_tensors

        graddist1 = graddist1.contiguous()
        graddist2 = graddist2.contiguous()

        gradxyz1 = torch.zeros(xyz1.size())
        gradxyz2 = torch.zeros(xyz2.size())

        if not graddist1.is_cuda:
            my_lib.nnd_backward(xyz1, xyz2, gradxyz1, gradxyz2, graddist1, graddist2, idx1, idx2)
        else:
            gradxyz1 = gradxyz1.cuda()
            gradxyz2 = gradxyz2.cuda()
            my_lib.nnd_backward_cuda(xyz1, xyz2, gradxyz1, gradxyz2, graddist1, graddist2, idx1, idx2)

        return gradxyz1, gradxyz2

I have tested this modified code, and it does not cause memory leak (using the provided testing code by @WangZixuan ). Also I think the dist1 and dist2 are unused and can be deleted/no longer stored (although in the modified code I did not delete them)

YuanxunLu commented 5 years ago

how can I use the nndistance on Windows instead of Linux?

sohee-zoe commented 5 years ago

Hi @WangZixuan @pclucas14 @tejaskhot

The memory leak problem is caused by incorrect use of Pytorch 0.4.

In the file functions/nnd.py, we should not use "self" anymore, but instead the "ctx" to store the internal values. So it can be fixed as follows:

class NNDFunction(Function):
    def forward(ctx, xyz1, xyz2):
        batchsize, n, _ = xyz1.size()
        _, m, _ = xyz2.size()`

        dist1 = torch.zeros(batchsize, n)
        dist2 = torch.zeros(batchsize, m)

        idx1 = torch.zeros(batchsize, n).type(torch.IntTensor)
        idx2 = torch.zeros(batchsize, m).type(torch.IntTensor)

        if not xyz1.is_cuda:
            my_lib.nnd_forward(xyz1, xyz2, dist1, dist2, idx1, idx2)
        else:
            dist1 = dist1.cuda()
            dist2 = dist2.cuda()
            idx1 = idx1.cuda()
            idx2 = idx2.cuda()
            my_lib.nnd_forward_cuda(xyz1, xyz2, dist1, dist2, idx1, idx2)

        ctx.save_for_backward(xyz1, xyz2, idx1, idx2, dist1, dist2)

        return dist1, dist2

    def backward(ctx, graddist1, graddist2):
        xyz1, xyz2, idx1, idx2, dist1, dist2 = ctx.saved_tensors

        graddist1 = graddist1.contiguous()
        graddist2 = graddist2.contiguous()

        gradxyz1 = torch.zeros(xyz1.size())
        gradxyz2 = torch.zeros(xyz2.size())

        if not graddist1.is_cuda:
            my_lib.nnd_backward(xyz1, xyz2, gradxyz1, gradxyz2, graddist1, graddist2, idx1, idx2)
        else:
            gradxyz1 = gradxyz1.cuda()
            gradxyz2 = gradxyz2.cuda()
            my_lib.nnd_backward_cuda(xyz1, xyz2, gradxyz1, gradxyz2, graddist1, graddist2, idx1, idx2)

        return gradxyz1, gradxyz2

I have tested this modified code, and it does not cause memory leak (using the provided testing code by @WangZixuan ). Also I think the dist1 and dist2 are unused and can be deleted/no longer stored (although in the modified code I did not delete them)

Thank you so much