Open zetal-tip opened 3 weeks ago
Yes, I also made a mistake in training to 15,010 rounds.
In fact, when I set self.use_tcnn = False, still at 15,010 times, the loss becomes NaN. May I ask what to do in this case?
Same, stuck in 15,010 iters
Hello, it does happen when using windows that the loss becomes NaN, but when I configure the environment with docker, my training continues and the versions of the relevant packages are consistent. I hope you found this helpful.
Hello, it does happen when using windows that the loss becomes NaN, but when I configure the environment with docker, my training continues and the versions of the relevant packages are consistent. I hope you found this helpful.
Hey, thanks for the info! Actually, I'm running into this NaN loss issue on Linux, not Windows. Since you've got it working in Docker, would you mind sharing your setup?
Same, stuck in 15,010 iters
I changed my pytorch_lightning version to 1.9.5, and it worked.
Hello, it does happen when using windows that the loss becomes NaN, but when I configure the environment with docker, my training continues and the versions of the relevant packages are consistent. I hope you found this helpful.
Hey, thanks for the info! Actually, I'm running into this NaN loss issue on Linux, not Windows. Since you've got it working in Docker, would you mind sharing your setup?
I haven't faced this problem on training DTU dataset.
您好,使用 Windows 时确实会丢失 NaN,但是当我使用 docker 配置环境时,我的培训仍在继续,并且相关包的版本是一致的。我希望这对您有所帮助。
嘿,谢谢你的信息!实际上,我在 Linux 上遇到了 NaN 丢失问题,而不是 Windows。既然您已经在 Docker 中工作了,您介意分享您的设置吗? Do you want me to give you the images?
I haven't faced this problem on training DTU dataset.
Thanks for the tip! I’ve been able to train on the DTU dataset without any issues, but I’m still getting errors in the truck scene.
Hi, Thanks for your great work. There will be an error when running the truck scene like this:
/opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [0,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [1,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [3,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. ....... /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [125,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [126,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [127,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. Traceback (most recent call last): File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage self._run_train() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train self.fit_loop.run() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance batch_output = self.batch_loop.run(kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 379, in _optimizer_step using_lbfgs=is_lbfgs, File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 235, in optimizer_step optimizer, model=model, optimizer_idx=opt_idx, closure=closure, kwargs File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 119, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/optim/adamw.py", line 119, in step loss = closure() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 105, in _wrap_closure closure_result = closure() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in call self._result = self.closure(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure step_output = self._step_fn() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", kwargs.values()) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook output = fn(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step return self.model(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, *kwargs) File "/home/GSDF/instant_nsr/systems/neus.py", line 441, in training_step out = self(batch, picked_gs_depth_dt, use_depth_guide=False) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/GSDF/instant_nsr/systems/neus.py", line 236, in forward return self.model(batch['rays'], gs_depth, use_depth_guide) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/GSDF/instantnsr/models/neus.py", line 440, in forward out = self.forward(rays, gs_depth, use_depth_guide) File "/home/GSDF/instantnsr/models/neus.py", line 316, in forward ray_indices, midpoints, positions, dists, intersected_ray_indices = self.ray_upsampe_hier(rays_o=rays_o, rays_d=rays_d, gs_depth=gs_depth, use_depth_guide=use_depth_guide) File "/home/GSDF/instant_nsr/models/neus.py", line 181, in ray_upsampe_hier intersected_ray_indices = ((t_max > 0) & (t_max < 1e9) & (gs_depth_probe.squeeze(dim=-1) < t_max) & (gs_depth_probe.squeeze(dim=-1) > t_min)).nonzero(as_tuple=False).view(-1) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "launch.py", line 181, in
main()
File "launch.py", line 170, in main
trainer.fit(system, datamodule=dm)
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 609, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt
trainer._teardown()
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1175, in _teardown
self.strategy.teardown()
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 490, in teardown
super().teardown()
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/parallel.py", line 128, in teardown
super().teardown()
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 496, in teardown
self.lightning_module.cpu()
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 78, in cpu
return super().cpu()
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 738, in cpu
return self._apply(lambda t: t.cpu())
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 738, in
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Training progress: 33%|████████████████████████████████████████████████████████████████▍ | 15010/45000 [10:54<21:48, 22.93it/s, Loss=0.0385715]
Epoch 0: : 0it [00:03, ?it/s]Exception ignored in: <function tqdm.del at 0x7f068a407a70>
Traceback (most recent call last):
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/std.py", line 1148, in del
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/std.py", line 1303, in close
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/std.py", line 1287, in fp_write
File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/utils.py", line 196, in inner
File "/home/GSDF/gaussian_splatting/utils/general_utils.py", line 144, in write
ImportError: sys.meta_path is None, Python is likely shutting down
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1659484801627/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f069b63c497 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1d4a3 (0x7f06c8c4d4a3 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x237 (0x7f06c8c53437 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x46e578 (0x7f06db478578 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f069b61fd95 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: + 0x35fb45 (0x7f06db369b45 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x6b03e0 (0x7f06db6ba3e0 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object ) + 0x308 (0x7f06db6ba7e8 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so)