city-super / GSDF

[NeurIPS 2024]GSDF: 3DGS Meets SDF for Improved Rendering and Reconstruction
Other
295 stars 11 forks source link

RuntimeError: CUDA error: device-side assert triggered #12

Open zetal-tip opened 3 weeks ago

zetal-tip commented 3 weeks ago

Hi, Thanks for your great work. There will be an error when running the truck scene like this: /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [0,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [1,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [3,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. ....... /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [125,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [126,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /opt/conda/conda-bld/pytorch_1659484801627/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [127,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. Traceback (most recent call last): File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage self._run_train() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train self.fit_loop.run() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance batch_output = self.batch_loop.run(kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 379, in _optimizer_step using_lbfgs=is_lbfgs, File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 235, in optimizer_step optimizer, model=model, optimizer_idx=opt_idx, closure=closure, kwargs File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 119, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/optim/adamw.py", line 119, in step loss = closure() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 105, in _wrap_closure closure_result = closure() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in call self._result = self.closure(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure step_output = self._step_fn() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", kwargs.values()) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook output = fn(args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step return self.model(*args, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, *kwargs) File "/home/GSDF/instant_nsr/systems/neus.py", line 441, in training_step out = self(batch, picked_gs_depth_dt, use_depth_guide=False) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/GSDF/instant_nsr/systems/neus.py", line 236, in forward return self.model(batch['rays'], gs_depth, use_depth_guide) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/GSDF/instantnsr/models/neus.py", line 440, in forward out = self.forward(rays, gs_depth, use_depth_guide) File "/home/GSDF/instantnsr/models/neus.py", line 316, in forward ray_indices, midpoints, positions, dists, intersected_ray_indices = self.ray_upsampe_hier(rays_o=rays_o, rays_d=rays_d, gs_depth=gs_depth, use_depth_guide=use_depth_guide) File "/home/GSDF/instant_nsr/models/neus.py", line 181, in ray_upsampe_hier intersected_ray_indices = ((t_max > 0) & (t_max < 1e9) & (gs_depth_probe.squeeze(dim=-1) < t_max) & (gs_depth_probe.squeeze(dim=-1) > t_min)).nonzero(as_tuple=False).view(-1) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "launch.py", line 181, in main() File "launch.py", line 170, in main trainer.fit(system, datamodule=dm) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 609, in fit self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt trainer._teardown() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1175, in _teardown self.strategy.teardown() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 490, in teardown super().teardown() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/parallel.py", line 128, in teardown super().teardown() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 496, in teardown self.lightning_module.cpu() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 78, in cpu return super().cpu() File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 738, in cpu return self._apply(lambda t: t.cpu()) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 602, in _apply param_applied = fn(param) File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 738, in return self._apply(lambda t: t.cpu()) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 33%|████████████████████████████████████████████████████████████████▍ | 15010/45000 [10:54<21:48, 22.93it/s, Loss=0.0385715] Epoch 0: : 0it [00:03, ?it/s]Exception ignored in: <function tqdm.del at 0x7f068a407a70> Traceback (most recent call last): File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/std.py", line 1148, in del File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/std.py", line 1303, in close File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/std.py", line 1287, in fp_write File "/home/miniconda3/envs/gsdf/lib/python3.7/site-packages/tqdm/utils.py", line 196, in inner File "/home/GSDF/gaussian_splatting/utils/general_utils.py", line 144, in write ImportError: sys.meta_path is None, Python is likely shutting down terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1659484801627/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f069b63c497 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x1d4a3 (0x7f06c8c4d4a3 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x237 (0x7f06c8c53437 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x46e578 (0x7f06db478578 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f069b61fd95 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libc10.so) frame #5: + 0x35fb45 (0x7f06db369b45 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x6b03e0 (0x7f06db6ba3e0 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object) + 0x308 (0x7f06db6ba7e8 in /home/miniconda3/envs/gsdf/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #25: __libc_start_main + 0xe7 (0x7f071d810c87 in /lib/x86_64-linux-gnu/libc.so.6) ./train.sh: line 12: 19590 Aborted (core dumped) python launch.py --exp_dir ${exp_dir} --config ${config} --gpu ${gpu} --train --eval tag=${tag}` How should I fix it?
df34 commented 3 weeks ago

Yes, I also made a mistake in training to 15,010 rounds.

df34 commented 3 weeks ago

In fact, when I set self.use_tcnn = False, still at 15,010 times, the loss becomes NaN. May I ask what to do in this case?

ThePassedWind commented 2 weeks ago

Same, stuck in 15,010 iters image

df34 commented 2 weeks ago

Hello, it does happen when using windows that the loss becomes NaN, but when I configure the environment with docker, my training continues and the versions of the relevant packages are consistent. I hope you found this helpful.

zetal-tip commented 2 weeks ago

Hello, it does happen when using windows that the loss becomes NaN, but when I configure the environment with docker, my training continues and the versions of the relevant packages are consistent. I hope you found this helpful.

Hey, thanks for the info! Actually, I'm running into this NaN loss issue on Linux, not Windows. Since you've got it working in Docker, would you mind sharing your setup?

ThePassedWind commented 2 weeks ago

Same, stuck in 15,010 iters image

I changed my pytorch_lightning version to 1.9.5, and it worked.

ThePassedWind commented 2 weeks ago

Hello, it does happen when using windows that the loss becomes NaN, but when I configure the environment with docker, my training continues and the versions of the relevant packages are consistent. I hope you found this helpful.

Hey, thanks for the info! Actually, I'm running into this NaN loss issue on Linux, not Windows. Since you've got it working in Docker, would you mind sharing your setup?

I haven't faced this problem on training DTU dataset.

df34 commented 2 weeks ago

您好,使用 Windows 时确实会丢失 NaN,但是当我使用 docker 配置环境时,我的培训仍在继续,并且相关包的版本是一致的。我希望这对您有所帮助。

嘿,谢谢你的信息!实际上,我在 Linux 上遇到了 NaN 丢失问题,而不是 Windows。既然您已经在 Docker 中工作了,您介意分享您的设置吗? Do you want me to give you the images?

zetal-tip commented 2 weeks ago

I haven't faced this problem on training DTU dataset.

Thanks for the tip! I’ve been able to train on the DTU dataset without any issues, but I’m still getting errors in the truck scene.