bennyguo / instant-nsr-pl

Neural Surface reconstruction based on Instant-NGP. Efficient and customizable boilerplate for your research projects. Train NeuS in 10min!
MIT License
857 stars 84 forks source link

OOM during backward #19

Closed csrqli closed 1 year ago

csrqli commented 2 years ago

I tried to train DTU dataset using this implementation, but GPU memory exceed during backward after a few steps.

If I delete the rgb loss, than OOM disappear. The shape of tensor: rgb_ground_truth looks correct.

Any idea why this happened? Thanks in advance!

/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.54 GiB (GPU 0; 39.45 GiB total capacity; 21.52 GiB already allocated; 1.53 GiB free; 27.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

bennyguo commented 2 years ago

Hi! I think there are two simple ways: (1) reduce max_train_num_rays; (2) or set dynamic_ray_sampling to false and set train_num_rays to a larger value. It is unusual to train without the RGB loss, and there could be large occupied areas on the image canvas (which leads to less pruned rays) since background model is not implemented when you train on DTU.

csrqli commented 2 years ago

Thanks for the prompt reply! I tried setting dynamic_ray_sampling to false and train_num_rays=128 however OOM still happened.

Yes I did not implement the background model, but I utilized the foreground mask provided in the original dataset to train.

I think the following message indicates the error occurred in encoding of geometry model.

I'm still working on tracing this, and I would appreciate it if some hints could be given based on the information.

File "/home/comp/username/instant-nsr-pl/systems/neus.py", line 85, in training_step
    out = self(batch)
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/comp/username/instant-nsr-pl/systems/neus.py", line 47, in forward
    out = self.model(batch['rays'])
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/comp/username/instant-nsr-pl/models/neus.py", line 164, in forward
    out = self.forward_(rays)
  File "/home/comp/username/instant-nsr-pl/models/neus.py", line 125, in forward_
    ray_indices, t_starts, t_ends = ray_marching(
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/nerfacc/ray_marching.py", line 202, in ray_marching
    alphas = alpha_fn(t_starts, t_ends, ray_indices.long())
  File "/home/comp/username/instant-nsr-pl/models/neus.py", line 104, in alpha_fn
    sdf, sdf_grad = self.geometry(positions, with_grad=True, with_feature=False)
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/comp/username/instant-nsr-pl/models/geometry.py", line 141, in forward
    out = self.network(self.encoding(points.view(-1, 3))).view(*points.shape[:-1], self.n_output_dims).float()
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/comp/username/instant-nsr-pl/models/network_utils.py", line 49, in forward
    return self.encoding(x, *args) if not self.include_xyz else torch.cat([x * self.xyz_scale + self.xyz_offset, self.encoding(x, *args)], dim=-1)
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/tinycudann-1.6-py3.10-linux-x86_64.egg/tinycudann/modules.py", line 145, in forward
    output = _module_function.apply(
  File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/tinycudann-1.6-py3.10-linux-x86_64.egg/tinycudann/modules.py", line 57, in forward
    native_ctx, output = native_tcnn_module.fwd(input, params)
RuntimeError: /home/comp/username/tcnn/cuda117/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h:584 cuMemCreate(&m_handles.back(), n_bytes_to_allocate, &prop, 0) failed with error CUDA_ERROR_OUT_OF_MEMORY
Epoch 0: : 32it [00:24,  1.31it/s, loss=3.37, train/inv_s=20.20, train/num_rays=128.0]
bennyguo commented 2 years ago

Could you share the training data so that I can debug?

csrqli commented 2 years ago

Could you share the training data so that I can debug?

Thanks a lot! I've emailed your :)

bennyguo commented 1 year ago

Hi, could you try our latest code and see if the problem is gone?

csrqli commented 1 year ago

The problem is gone. Many thanks!