错误提示如下:
2023-03-29 05:03:13,493 - mmdet - INFO - Epoch [1][1860/4004] lr: 3.907e-04, eta: 3 days, 9:03:25, time: 3.341, data_time: 0.491, memory: 18153, positive_bag_loss: 1.4544, negative_bag_loss: 0.1518, loss: 1.6061, grad_norm: 1.5741
Traceback (most recent call last):
File "tools/train.py", line 279, in
main()
File "tools/train.py", line 268, in main
train_model(
File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 184, in train_model
train_detector(
File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 159, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(inputs[0], kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(args, kwargs)
File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 294, in forward
return self.forward_train(img, img_metas, kwargs)
File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 301, in forward_train
feature_bev, valids, features_2d = self.extract_feat(img, img_metas, "train")
File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 123, in extract_feat
x = self.backbone(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 642, in forward
x = res_layer(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 89, in forward
out = _inner_forward(x)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 72, in _inner_forward
out = self.conv1(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6011) is killed by signal: Killed.
Killing subprocess 740
Killing subprocess 741
Killing subprocess 742
Killing subprocess 743
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tools/train.py', '--local_rank=3', './configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py', '--work-dir=./work_dirs/my/exp/', '--launcher=pytorch', '--gpus', '4']' returned non-zero exit status 1.
无法完成一个epoch,每次在不同的batch时出现错误。 问题1:seed的设置默认取的0,对于数据加载是无效的吗?如果生效应该每次错误发生在同样的时期。 问题2:使用作者提供的pkl以及自己生成的pkl都是在训练的一个epoch内就会出现错误。使用mini数据集可以完成20个epoch的迭代。
环境: sys.platform: linux Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29190527_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.8.1 PyTorch compiling details: PyTorch built with:
TorchVision: 0.9.1 OpenCV: 4.7.0 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0 MMSegmentation: 0.14.1 MMDetection3D: 0.16.0+69d67ff
错误提示如下: 2023-03-29 05:03:13,493 - mmdet - INFO - Epoch [1][1860/4004] lr: 3.907e-04, eta: 3 days, 9:03:25, time: 3.341, data_time: 0.491, memory: 18153, positive_bag_loss: 1.4544, negative_bag_loss: 0.1518, loss: 1.6061, grad_norm: 1.5741 Traceback (most recent call last): File "tools/train.py", line 279, in
main()
File "tools/train.py", line 268, in main
train_model(
File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 184, in train_model
train_detector(
File "/workspace/Fast-BEV/mmdet3d/apis/train.py", line 159, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(inputs[0], kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(args, kwargs)
File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 294, in forward
return self.forward_train(img, img_metas, kwargs)
File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 301, in forward_train
feature_bev, valids, features_2d = self.extract_feat(img, img_metas, "train")
File "/workspace/Fast-BEV/mmdet3d/models/detectors/fastbev.py", line 123, in extract_feat
x = self.backbone(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 642, in forward
x = res_layer(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 89, in forward
out = _inner_forward(x)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 72, in _inner_forward
out = self.conv1(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6011) is killed by signal: Killed.
Killing subprocess 740
Killing subprocess 741
Killing subprocess 742
Killing subprocess 743
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tools/train.py', '--local_rank=3', './configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f4.py', '--work-dir=./work_dirs/my/exp/', '--launcher=pytorch', '--gpus', '4']' returned non-zero exit status 1.