Closed csrqli closed 1 year ago
Hi! I think there are two simple ways:
(1) reduce max_train_num_rays
;
(2) or set dynamic_ray_sampling
to false
and set train_num_rays
to a larger value.
It is unusual to train without the RGB loss, and there could be large occupied areas on the image canvas (which leads to less pruned rays) since background model is not implemented when you train on DTU.
Thanks for the prompt reply! I tried setting dynamic_ray_sampling to false and train_num_rays=128 however OOM still happened.
Yes I did not implement the background model, but I utilized the foreground mask provided in the original dataset to train.
I think the following message indicates the error occurred in encoding of geometry model.
I'm still working on tracing this, and I would appreciate it if some hints could be given based on the information.
File "/home/comp/username/instant-nsr-pl/systems/neus.py", line 85, in training_step
out = self(batch)
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
return forward_call(*input, **kwargs)
File "/home/comp/username/instant-nsr-pl/systems/neus.py", line 47, in forward
out = self.model(batch['rays'])
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
return forward_call(*input, **kwargs)
File "/home/comp/username/instant-nsr-pl/models/neus.py", line 164, in forward
out = self.forward_(rays)
File "/home/comp/username/instant-nsr-pl/models/neus.py", line 125, in forward_
ray_indices, t_starts, t_ends = ray_marching(
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/nerfacc/ray_marching.py", line 202, in ray_marching
alphas = alpha_fn(t_starts, t_ends, ray_indices.long())
File "/home/comp/username/instant-nsr-pl/models/neus.py", line 104, in alpha_fn
sdf, sdf_grad = self.geometry(positions, with_grad=True, with_feature=False)
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
return forward_call(*input, **kwargs)
File "/home/comp/username/instant-nsr-pl/models/geometry.py", line 141, in forward
out = self.network(self.encoding(points.view(-1, 3))).view(*points.shape[:-1], self.n_output_dims).float()
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
return forward_call(*input, **kwargs)
File "/home/comp/username/instant-nsr-pl/models/network_utils.py", line 49, in forward
return self.encoding(x, *args) if not self.include_xyz else torch.cat([x * self.xyz_scale + self.xyz_offset, self.encoding(x, *args)], dim=-1)
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1423, in _call_impl
return forward_call(*input, **kwargs)
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/tinycudann-1.6-py3.10-linux-x86_64.egg/tinycudann/modules.py", line 145, in forward
output = _module_function.apply(
File "/home/comp/username/miniconda3/envs/a100/lib/python3.10/site-packages/tinycudann-1.6-py3.10-linux-x86_64.egg/tinycudann/modules.py", line 57, in forward
native_ctx, output = native_tcnn_module.fwd(input, params)
RuntimeError: /home/comp/username/tcnn/cuda117/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h:584 cuMemCreate(&m_handles.back(), n_bytes_to_allocate, &prop, 0) failed with error CUDA_ERROR_OUT_OF_MEMORY
Epoch 0: : 32it [00:24, 1.31it/s, loss=3.37, train/inv_s=20.20, train/num_rays=128.0]
Could you share the training data so that I can debug?
Could you share the training data so that I can debug?
Thanks a lot! I've emailed your :)
Hi, could you try our latest code and see if the problem is gone?
The problem is gone. Many thanks!
I tried to train DTU dataset using this implementation, but GPU memory exceed during backward after a few steps.
If I delete the rgb loss, than OOM disappear. The shape of tensor: rgb_ground_truth looks correct.
Any idea why this happened? Thanks in advance!
/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.54 GiB (GPU 0; 39.45 GiB total capacity; 21.52 GiB already allocated; 1.53 GiB free; 27.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF