fundamentalvision / BEVFormer

[ECCV 2022] This is the official implementation of BEVFormer, a camera-only framework for autonomous driving perception, e.g., 3D object detection and semantic map segmentation.
https://arxiv.org/abs/2203.17270
Apache License 2.0
3.3k stars 538 forks source link

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281

Open xiaohuipoi opened 2 days ago

xiaohuipoi commented 2 days ago

Hello, I have a problem, when I train a bevformer_small on the base dataset, the first epoch works fine and saves the result to the json file of result, but when the second epoch training is completed, RROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary. Is it due to my lack of video memory, my lack of storage space, or some other issue?

GPU:4060ti 16g * 1 , Remaining storage space:190G/800G

xiaohuipoi commented 2 days ago
2024-10-14 20:45:26,306 - mmdet - INFO - Epoch [2][28100/28130] lr: 1.866e-04, eta: 2 days, 20:24:13, time: 2.191, data_time: 0.016, memory: 7344, loss_cls: 0.5252, loss_bbox: 0.6573, d0.loss_cls: 0.4906, d0.loss_bbox: 0.7357, d1.loss_cls: 0.5090, d1.loss_bbox: 0.6798, d2.loss_cls: 0.5115, d2.loss_bbox: 0.6681, d3.loss_cls: 0.5248, d3.loss_bbox: 0.6641, d4.loss_cls: 0.5259, d4.loss_bbox: 0.6617, loss: 7.1536, grad_norm: 158.5894
2024-10-14 20:46:32,026 - mmdet - INFO - Saving checkpoint at 2 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 2.6 task/s, elapsed: 2317s, ETA:     0s

Formating bboxes of pts_bbox
Start to convert detection format...
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 35.6 task/s, elapsed: 169s, ETA:     0s
Results writes to val/./work_dirs/bevformer_small/Sun_Oct_13_09_48_30_2024/pts_bbox/results_nusc.json
Evaluating bboxes of pts_bbox
======
Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 290596) of binary: /home/xiaohuipoi/anaconda3/envs/bev/bin/python
Traceback (most recent call last):
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
./tools/train.py FAILED
This is the result of my terminal running. thx