Vegeta2020 / SE-SSD

SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud, CVPR 2021.
Apache License 2.0
811 stars 128 forks source link

RuntimeError: Dataloader worker (pid(s) 19435) exited unexpectedly #61

Closed Carl12138aka closed 2 years ago

Carl12138aka commented 2 years ago

`2021-11-24 05:56:46,311 - INFO - Epoch [31/60][210/928] lr: 0.00278, eta: 11:25:38, time: 1.490, data_time: 0.018, transfer_time: 0.017, forward_time: 0.670, loss_parse_time: 0.000 memory: 3703, 2021-11-24 05:56:46,311 - INFO - task : ['Car'], loss: 0.8918, cls_loss_reduced: 0.1903, loc_loss_reduced: 0.2803, dir_loss_reduced: 0.0349, iou_pred_loss: 0.0804, consistency_loss: 0.0556, loc_loss_elem: ['0.0058', '0.0053', '0.0284', '0.0174', '0.0239', '0.0216', '0.0377'], cls_pos_loss: 0.1345, cls_neg_loss: 0.0558, ious_loss: 0.5306, num_pos: 72.9000, num_neg: 70218.3000, loss_ema: 0.2507, cls_loss_reduced_ema: 0.1573, loc_loss_reduced_ema: 0.2217, dir_loss_reduced_ema: 0.0244, iou_pred_loss_ema: 0.0690, loc_loss_elem_ema: ['0.0065', '0.0033', '0.0201', '0.0152', '0.0205', '0.0194', '0.0259'], cls_pos_loss_ema: 0.1030, cls_neg_loss_ema: 0.0544, num_pos_ema: 73.0000, num_neg_ema: 70216.3000

2021-11-24 05:57:01,158 - INFO - Epoch [31/60][220/928] lr: 0.00278, eta: 11:25:24, time: 1.485, data_time: 0.018, transfer_time: 0.017, forward_time: 0.681, loss_parse_time: 0.000 memory: 3703, 2021-11-24 05:57:01,158 - INFO - task : ['Car'], loss: 0.9807, cls_loss_reduced: 0.2094, loc_loss_reduced: 0.3321, dir_loss_reduced: 0.0484, iou_pred_loss: 0.0874, consistency_loss: 0.0610, loc_loss_elem: ['0.0071', '0.0057', '0.0318', '0.0202', '0.0308', '0.0267', '0.0437'], cls_pos_loss: 0.1559, cls_neg_loss: 0.0535, ious_loss: 0.5745, num_pos: 72.0000, num_neg: 70218.4000, loss_ema: 0.2532, cls_loss_reduced_ema: 0.1520, loc_loss_reduced_ema: 0.2008, dir_loss_reduced_ema: 0.0270, iou_pred_loss_ema: 0.0741, loc_loss_elem_ema: ['0.0059', '0.0033', '0.0217', '0.0140', '0.0191', '0.0182', '0.0182'], cls_pos_loss_ema: 0.0948, cls_neg_loss_ema: 0.0572, num_pos_ema: 77.2000, num_neg_ema: 70205.4000

2021-11-24 05:57:15,839 - INFO - Epoch [31/60][230/928] lr: 0.00278, eta: 11:25:08, time: 1.468, data_time: 0.020, transfer_time: 0.017, forward_time: 0.656, loss_parse_time: 0.000 memory: 3703, 2021-11-24 05:57:15,840 - INFO - task : ['Car'], loss: 0.8810, cls_loss_reduced: 0.1796, loc_loss_reduced: 0.2764, dir_loss_reduced: 0.0373, iou_pred_loss: 0.0820, consistency_loss: 0.0527, loc_loss_elem: ['0.0066', '0.0043', '0.0249', '0.0166', '0.0255', '0.0267', '0.0336'], cls_pos_loss: 0.1226, cls_neg_loss: 0.0569, ious_loss: 0.5294, num_pos: 69.8000, num_neg: 70222.8000, loss_ema: 0.2269, cls_loss_reduced_ema: 0.1348, loc_loss_reduced_ema: 0.1698, dir_loss_reduced_ema: 0.0238, iou_pred_loss_ema: 0.0682, loc_loss_elem_ema: ['0.0053', '0.0025', '0.0139', '0.0121', '0.0186', '0.0193', '0.0132'], cls_pos_loss_ema: 0.0787, cls_neg_loss_ema: 0.0561, num_pos_ema: 72.3000, num_neg_ema: 70217.5000

2021-11-24 05:57:30,428 - INFO - Epoch [31/60][240/928] lr: 0.00278, eta: 11:24:53, time: 1.459, data_time: 0.018, transfer_time: 0.017, forward_time: 0.661, loss_parse_time: 0.000 memory: 3703, 2021-11-24 05:57:30,428 - INFO - task : ['Car'], loss: 0.9554, cls_loss_reduced: 0.2016, loc_loss_reduced: 0.3169, dir_loss_reduced: 0.0402, iou_pred_loss: 0.0874, consistency_loss: 0.0569, loc_loss_elem: ['0.0068', '0.0049', '0.0294', '0.0192', '0.0280', '0.0259', '0.0441'], cls_pos_loss: 0.1453, cls_neg_loss: 0.0564, ious_loss: 0.5693, num_pos: 73.5000, num_neg: 70216.1000, loss_ema: 0.2422, cls_loss_reduced_ema: 0.1535, loc_loss_reduced_ema: 0.1874, dir_loss_reduced_ema: 0.0226, iou_pred_loss_ema: 0.0660, loc_loss_elem_ema: ['0.0054', '0.0029', '0.0182', '0.0127', '0.0205', '0.0151', '0.0190'], cls_pos_loss_ema: 0.0925, cls_neg_loss_ema: 0.0610, num_pos_ema: 75.9000, num_neg_ema: 70212.4000

Traceback (most recent call last): File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/home/miao/anaconda3/envs/sessd/lib/python3.7/multiprocessing/connection.py", line 913, in wait with _WaitSelector() as selector: File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 19435) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 118, in main() File "train.py", line 115, in main train_detector(model, datasets, cfg, distributed=distributed, validate=args.validate, logger=logger,) File "/home/miao/Music/SE-SSD/det3d/torchie/apis/train_sessd.py", line 323, in train_detector trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank) File "/home/miao/Music/SE-SSD/det3d/torchie/trainer/trainer_sessd.py", line 472, in run epoch_runner(data_loaders[0], data_loaders[1], self.epoch, **kwargs) File "/home/miao/Music/SE-SSD/det3d/torchie/trainer/trainer_sessd.py", line 333, in train for i, data_batch in enumerate(data_loader): File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data idx, data = self._get_data() File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data success, data = self._try_get_data() File "/home/miao/anaconda3/envs/sessd/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 19435) exited unexpectedly`

Do anybody knows what happened? How can I train correctly?

Carl12138aka commented 2 years ago

I solved this problem by changing workers_per_gpu from 4 to 2.