bennyguo / instant-nsr-pl

Neural Surface reconstruction based on Instant-NGP. Efficient and customizable boilerplate for your research projects. Train NeuS in 10min!
MIT License
857 stars 84 forks source link

CUDA error: invalid configuration argument #73

Open zsy950116 opened 1 year ago

zsy950116 commented 1 year ago

Thank you for your work. Your work is excellent. I was able to train normally and get satisfactory results, but during the verification process, the following error occurs:

Global seed set to 42
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
fatal: 不是一个 git 仓库(或者直至挂载点 / 的任何父目录)
停止在文件系统边界(未设置 GIT_DISCOVERY_ACROSS_FILESYSTEM)。
/home/zsy/instant-nsr-pl-main/utils/callbacks.py:76: UserWarning: Code snapshot is not saved. Please make sure you have git installed and are in a git repository.
  rank_zero_warn("Code snapshot is not saved. Please make sure you have git installed and are in a git repository.")

  | Name  | Type      | Params
------------------------------------
0 | model | NeRFModel | 12.6 M
------------------------------------
12.6 M    Trainable params
0         Non-trainable params
12.6 M    Total params
25.220    Total estimated model params size (MB)
Epoch 0: : 600it [00:28, 21.05it/s, loss=0.00194, train/num_rays=8192.0]/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:539: PossibleUserWarning: It is recommended to use `self.log('val/psnr', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
  warning_cache.warn(
Epoch 0: : 3517it [02:17, 25.67it/s, loss=0.000735, train/num_rays=8192.0, val/psnr=31.60]Traceback (most recent call last):
  File "launch.py", line 128, in <module>██████                                                                                   | 17/100 [00:02<00:13,  6.36it/s]
    main()
  File "launch.py", line 117, in main
    trainer.fit(system, datamodule=dm)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
    return function(*args, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 250, in on_advance_end
    self._run_validation()
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 308, in _run_validation
    self.val_loop.run()
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 359, in validation_step
    return self.model(*args, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 110, in forward
    return self._forward_module.validation_step(*inputs, **kwargs)
  File "/home/zsy/instant-nsr-pl-main/systems/nerf.py", line 137, in validation_step
    out = self(batch)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zsy/instant-nsr-pl-main/systems/nerf.py", line 31, in forward
    return self.model(batch['rays'])
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zsy/instant-nsr-pl-main/models/nerf.py", line 133, in forward
    out = chunk_batch(self.forward_, self.config.ray_chunk, True, rays)
  File "/home/zsy/instant-nsr-pl-main/models/utils.py", line 22, in chunk_batch
    out_chunk = func(*[arg[i:i+chunk_size] if isinstance(arg, torch.Tensor) else arg for arg in args], **kwargs)
  File "/home/zsy/instant-nsr-pl-main/models/nerf.py", line 103, in forward_
    rgb = self.texture(feature, t_dirs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zsy/instant-nsr-pl-main/models/texture.py", line 27, in forward
    color = self.network(network_inp).view(*features.shape[:-1], self.n_output_dims).float()
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zsy/anaconda3/envs/djx_pytorch/lib/python3.8/site-packages/tinycudann-1.7-py3.8-linux-x86_64.egg/tinycudann/modules.py", line 180, in forward
    self.params.to(_torch_precision(self.native_tcnn_module.param_precision())).contiguous(),
RuntimeError: CUDA error: invalid configuration argument
Epoch 0: : 3517it [02:17, 25.59it/s, loss=0.000735, train/num_rays=8192.0, val/psnr=31.60]

My GPU is a single 3090, in training and testing phase, video memory is normal.

F1FP(NFWUT314RZ9WTE ~99

bennyguo commented 1 year ago

I haven't encountered this problem, but sometimes weird things did happen like this 😂 I suggest you resume from the last checkpoint (ckpts/last.ckpt) and continue training, the problem may be gone.

BCL123456-BAL commented 1 year ago

I also encountered this error, when I run nerf, it is normal when I run neus.

wangyida commented 1 year ago

I'm assuming that such error happens only during validation, right? It's probably because of the bug introduced in nerfacc ray marching function. Nerfacc fixed it in newer versions as mentioned here. As mentioned, changing PyTorch to v1.13 is a brutal solution, it works for me before at least, otherwise we need to adapt the codes for newer nerfacc toolbox.