训练过程中会出现错误

AndrewJSong commented 1 year ago

无法完成一个epoch，每次在不同的batch时出现错误。问题1：seed的设置默认取的0，对于数据加载是无效的吗？如果生效应该每次错误发生在同样的时期。问题2：使用作者提供的pkl以及自己生成的pkl都是在训练的一个epoch内就会出现错误。使用mini数据集可以完成20个epoch的迭代。

环境： sys.platform: linux Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29190527_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.1 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
CuDNN 8.0.5
Magma 2.5.2

TorchVision: 0.9.1 OpenCV: 4.7.0 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0 MMSegmentation: 0.14.1 MMDetection3D: 0.16.0+69d67ff

错误提示如下： 2023-03-29 05:03:13,493 - mmdet - INFO - Epoch [1][1860/4004] lr: 3.907e-04, eta: 3 days, 9:03:25, time: 3.341, data_time: 0.491, memory: 18153, positive_bag_loss: 1.4544, negative_bag_loss: 0.1518, loss: 1.6061, grad_norm: 1.5741 Traceback (most recent call last): File "tools/train.py", line 279, in main() File "tools/train.py", line 268, in main train_model( File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 184, in train_model train_detector( File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 159, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step losses = self(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func return old_func(args, kwargs) File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 294, in forward return self.forward_train(img, img_metas, kwargs) File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 301, in forward_train feature_bev, valids, features_2d = self.extract_feat(img, img_metas, "train") File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 123, in extract_feat x = self.backbone( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 642, in forward x = res_layer(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 89, in forward out = _inner_forward(x) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 72, in _inner_forward out = self.conv1(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward return self._conv_forward(input, self.weight, self.bias) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward return F.conv2d(input, weight, bias, self.stride, File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 6011) is killed by signal: Killed. Killing subprocess 740 Killing subprocess 741 Killing subprocess 742 Killing subprocess 743 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tools/train.py', '--local_rank=3', './configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py', '--work-dir=./work_dirs/my/exp/', '--launcher=pytorch', '--gpus', '4']' returned non-zero exit status 1.

chr10003566 commented 8 months ago

hello 你有解决方案嘛？我也碰到同样的情况，一般一个epoch会出现一次 https://discuss.pytorch.org/t/died-with-signals-sigkill-9-when-in-first-epoch-the-program-is-killed/131704/1 和这个链接差不多，但不太好定位，应该是哪里 out of memory了

keys-zlc commented 5 months ago

请问有人解决这个问题了嘛

evercherish commented 1 month ago

请问有人解决这个问题了嘛

Sense-GVT / Fast-BEV

训练过程中会出现错误 #44