Open xiaohuipoi opened 2 days ago
2024-10-14 20:45:26,306 - mmdet - INFO - Epoch [2][28100/28130] lr: 1.866e-04, eta: 2 days, 20:24:13, time: 2.191, data_time: 0.016, memory: 7344, loss_cls: 0.5252, loss_bbox: 0.6573, d0.loss_cls: 0.4906, d0.loss_bbox: 0.7357, d1.loss_cls: 0.5090, d1.loss_bbox: 0.6798, d2.loss_cls: 0.5115, d2.loss_bbox: 0.6681, d3.loss_cls: 0.5248, d3.loss_bbox: 0.6641, d4.loss_cls: 0.5259, d4.loss_bbox: 0.6617, loss: 7.1536, grad_norm: 158.5894
2024-10-14 20:46:32,026 - mmdet - INFO - Saving checkpoint at 2 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 2.6 task/s, elapsed: 2317s, ETA: 0s
Formating bboxes of pts_bbox
Start to convert detection format...
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 35.6 task/s, elapsed: 169s, ETA: 0s
Results writes to val/./work_dirs/bevformer_small/Sun_Oct_13_09_48_30_2024/pts_bbox/results_nusc.json
Evaluating bboxes of pts_bbox
======
Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 290596) of binary: /home/xiaohuipoi/anaconda3/envs/bev/bin/python
Traceback (most recent call last):
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
./tools/train.py FAILED
This is the result of my terminal running. thx
Hello, I have a problem, when I train a bevformer_small on the base dataset, the first epoch works fine and saves the result to the json file of result, but when the second epoch training is completed, RROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary. Is it due to my lack of video memory, my lack of storage space, or some other issue?
GPU:4060ti 16g * 1 , Remaining storage space:190G/800G