NotACracker / COTR

[CVPR24] COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Apache License 2.0

40 stars 0 forks source link

你好，我在训练的时候，epoch4之后在临时测试的时候，报了这个错误，请问是什么问题？

2024-06-26 07:44:13,280 - mmdet - INFO - Epoch [4][7000/7033] lr: 1.000e-04, eta: 2 days, 1:50:24, time: 1.272, data_time: 0.013, memory: 15817, loss_occ: 0.2565, loss_cls: 0.0794, loss_mask: 0.3734, loss_dice: 0.7583, loss_depth: 0.1387, loss: 1.6063, grad_norm: 2.772 7 2024-06-26 07:44:55,410 - mmdet - INFO - Saving checkpoint at 4 epochs Traceback (most recent call last): File "./tools/train_occ.py", line 263, in main() File "./tools/train_occ.py", line 252, in main train_occ_model( File "/home/lthpc/phd/cotr/mmdet3d/apis/train_occ.py", line 350, in train_occ_model train_detector( File "/home/lthpc/phd/cotr/mmdet3d/apis/train_occ.py", line 325, in train_detector runner.run(data_loaders, cfg.workflow)
File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run epoch_runner(data_loaders[i], **kwargs) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 56, in train self.call_hook('after_train_epoch') File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/home/lthpc/phd/cotr/mmdet3d/core/hook/eval_hooks.py", line 78, in _do_evaluate results = multi_gpu_test( File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmdet/apis/test.py", line 100, in multi_gpu_test model.eval() AttributeError: 'ModelEMA' object has no attribute 'eval'

terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: driver shutting down CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1a8234cd62 in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f1adf60c9ba in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f1adf60ecb0 in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11c (0x7f1adf60f77c in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #4: + 0xd6df4 (0x7f1b4a3bedf4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x8609 (0x7f1b4ff72609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f1b4fd3d353 in /lib/x86_64-linux-gnu/libc.so.6)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 279812 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 279813) of binary: /home/lthpc/anaconda3/envs/cotr/bin/python Traceback (most recent call last): File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train_occ.py FAILED

Failures: [1]: time : 2024-06-26_07:44:59 host : lthpc-Super-Server rank : 2 (local_rank: 2) exitcode : 1 (pid: 279814) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

你好，我在训练的时候，epoch4之后在临时测试的时候，报了这个错误，请问是什么问题？

2024-06-26 07:44:13,280 - mmdet - INFO - Epoch [4][7000/7033] lr: 1.000e-04, eta: 2 days, 1:50:24, time: 1.272, data_time: 0.013, memory: 15817, loss_occ: 0.2565, loss_cls: 0.0794, loss_mask: 0.3734, loss_dice: 0.7583, loss_depth: 0.1387, loss: 1.6063, grad_norm: 2.772 7 2024-06-26 07:44:55,410 - mmdet - INFO - Saving checkpoint at 4 epochs Traceback (most recent call last): File "./tools/train_occ.py", line 263, in main() File "./tools/train_occ.py", line 252, in main train_occ_model( File "/home/lthpc/phd/cotr/mmdet3d/apis/train_occ.py", line 350, in train_occ_model train_detector( File "/home/lthpc/phd/cotr/mmdet3d/apis/train_occ.py", line 325, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run epoch_runner(data_loaders[i], **kwargs) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 56, in train self.call_hook('after_train_epoch') File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/home/lthpc/phd/cotr/mmdet3d/core/hook/eval_hooks.py", line 78, in _do_evaluate results = multi_gpu_test( File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/mmdet/apis/test.py", line 100, in multi_gpu_test model.eval() AttributeError: 'ModelEMA' object has no attribute 'eval'

terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: driver shutting down CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1a8234cd62 in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f1adf60c9ba in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f1adf60ecb0 in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11c (0x7f1adf60f77c in /home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #4: + 0xd6df4 (0x7f1b4a3bedf4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x8609 (0x7f1b4ff72609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f1b4fd3d353 in /lib/x86_64-linux-gnu/libc.so.6)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 279812 closing signal SIGTERM

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 279813) of binary: /home/lthpc/anaconda3/envs/cotr/bin/python Traceback (most recent call last): File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lthpc/anaconda3/envs/cotr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train_occ.py FAILED

Failures: [1]: time : 2024-06-26_07:44:59 host : lthpc-Super-Server rank : 2 (local_rank: 2) exitcode : 1 (pid: 279814) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

抱歉，这是由于eval_hook.py文件的错误导致的。你可以将mmdet3d/core/hook/eval_hook.py中Line 62和79改为runner.ema_model.ema_model以修复这个bug

NotACracker / COTR

AttributeError: 'ModelEMA' object has no attribute 'eval' #7

./tools/train_occ.py FAILED

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 279812 closing signal SIGTERM

./tools/train_occ.py FAILED