Running test.py needs more GPU memory than train.py

rajat-talak commented 1 year ago

Hi @Lilac-Lee,

When I try running test,py on my computer, it gives me the RuntimeError: CUDA out of memory. This, however, does not happen for train.py. Why does the code require more GPU memory in testing, than in training? The training uses a higher batch size, while the test does not. Is this a bug?

PS - I changed the dataset from 3DMatch to modelnet. The issue remains. The test.py for 3DMatch attempts to allocate 3.91 GB, while test.py for ModelNet attempts to allocate 22.01GB.

Thank you,

Lilac-Lee commented 1 year ago

Hi @rajat-talak, thanks for reaching out.

What is the number of points you used for the ModelNet dataset? Also, when computing Jacobian, we only used 100 points, could you double check the Jacobian size? To use a Jacobian with a larger number of points, you might need to aggregate Jacobian computation.

Let me know if this problem persists. Cheers.

CVrookieee commented 1 year ago

Hi @Lilac-Lee ,

I encountered the same error. I used 1000 points for the ModelNet dataset. Details are as follows.

Traceback (most recent call last):
  File "test.py", line 157, in <module>
    main(ARGS)
  File "test.py", line 112, in main
    test(args, testset, dptnetlk)
  File "test.py", line 106, in test
    dptnetlk.test_one_epoch(model, testloader, args.device, 'test', args.data_type, args.vis)
  File "/home/***/PytorchProject/***/dptlk_o/trainer.py", line 149, in test_one_epoch
    p1, None, j, self.xtol, self.p0_zero_mean, self.p1_zero_mean, mode, data_type)
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 189, in do_forward
    r = net(q0, q1, mode, maxiter=maxiter, xtol=xtol, voxel_coords_diff=voxel_coords_diff, data_type=data_type, num_random_points=num_random_points)
  File "/home/***/Software/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 210, in forward
    r, g, itr = self.iclk_new(g0, p0, p1, maxiter, xtol, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type, num_random_points=num_random_points)
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 297, in iclk_new
    num_points, p0, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type)   # B x N x K x D, K=1024, D=3 or 6
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 231, in Cal_Jac
    Mask_fn, A_fn, Ax_fn, BN_fn, self.device).to(self.device)
  File "/home/***/PytorchProject/***/dptlk_o/utils.py", line 316, in feature_jac
    A3BN3M3 =  M3 * dBN3 * A3
RuntimeError: CUDA out of memory. Tried to allocate 22.01 GiB (GPU 1; 23.69 GiB total capacity; 2.61 GiB already allocated; 18.16 GiB free; 3.69 GiB reserved in total by PyTorch)

I located the problem in model.py line 294-297.

if mode == 'test':
            f0, Mask_fn, A_fn, Ax_fn, BN_fn, max_idx = self.ptnet(p0, -1)
            J = self.Cal_Jac(Mask_fn, A_fn, Ax_fn, BN_fn, max_idx,
                            num_points, p0, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type)   # B x N x K x D, K=1024, D=3 or 6

When debugging, I found the value of num_points is 927. I think this may be what you mentioned about computing a Jacobian with a larger number of points. But this function is very complicated for me and I am unable to modify it. I wish my comments might be helpful to you.

Thank you,

Lilac-Lee commented 1 year ago

Hi, I found the previous discussion about a similar issue here https://github.com/Lilac-Lee/PointNetLK_Revisited/issues/10 is interesting, and might be helpful. Could you take a look? Cheers.

CVrookieee commented 1 year ago

Hi @Lilac-Lee,

Thank you very much for your quick reply. I have looked issue #10 and it's enlightening. My problem has been solved, but not through the method in issue #10. I found that data_utils.Resampler() is implemented for mode train and val but not test. When testing, one point cloud contains more than 45000 points, and the calculation of the Jacobian will cause out of memory.

I modified lines 123-126 in test.py as follows (the last line is added).

if args.dataset_type == 'modelnet':
        transform = torchvision.transforms.Compose([\
                    data_utils.Mesh2Points(),\
                    data_utils.OnUnitCube(),\
                    data_utils.Resampler(args.num_points)])

Thank you again.

Lilac-Lee / PointNetLK_Revisited

Running test.py needs more GPU memory than train.py #12