fundamentalvision / BEVFormer

[ECCV 2022] This is the official implementation of BEVFormer, a camera-only framework for autonomous driving perception, e.g., 3D object detection and semantic map segmentation.
https://arxiv.org/abs/2203.17270
Apache License 2.0
3.4k stars 550 forks source link

DataLoader worker is killed by signal: Killed #55

Closed Bosszhe closed 2 years ago

Bosszhe commented 2 years ago

When I was training bevformer-base with batch_size =2, I met the Error:

Traceback (most recent call last):
  File "./tools/train.py", line 259, in <module>
    main()
  File "./tools/train.py", line 248, in main
    custom_train_model(
  File "/home/JJ_Group/wangz/wangzhe21/BEVFormer_wzh/projects/mmdet3d_plugin/bevformer/apis/train.py", line 27, in custom_train_model
    custom_train_detector(
  File "/home/JJ_Group/wangz/wangzhe21/BEVFormer_wzh/projects/mmdet3d_plugin/bevformer/apis/mmdet_train.py", line 199, in custom_train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 450887) is killed by signal: Killed. 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2532319) of binary: /home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/bin/python

The training process was normal in the first epoch and got the evaluation results as follows.

2022-07-03 17:19:56,577 - mmdet - INFO - Epoch [1][6850/7033]   lr: 2.000e-04, eta: 18 days, 2:20:33, time: 9.657, data_time: 2.484, memory: 49347, loss_cls: 0.3669, loss_bbox: 0.6684, d0.loss_cls: 0.3542, d0.loss_bbox: 0.7710, d1.loss_cls: 0.3529, d1.loss_
bbox: 0.6866, d2.loss_cls: 0.3549, d2.loss_bbox: 0.6724, d3.loss_cls: 0.3598, d3.loss_bbox: 0.6688, d4.loss_cls: 0.3602, d4.loss_bbox: 0.6663, loss: 6.2823, grad_norm: 50.7038                                                                                  
2022-07-03 17:28:38,627 - mmdet - INFO - Epoch [1][6900/7033]   lr: 2.000e-04, eta: 18 days, 2:27:51, time: 10.440, data_time: 2.962, memory: 49347, loss_cls: 0.3638, loss_bbox: 0.6734, d0.loss_cls: 0.3585, d0.loss_bbox: 0.7751, d1.loss_cls: 0.3556, d1.loss
_bbox: 0.6912, d2.loss_cls: 0.3550, d2.loss_bbox: 0.6796, d3.loss_cls: 0.3580, d3.loss_bbox: 0.6759, d4.loss_cls: 0.3581, d4.loss_bbox: 0.6746, loss: 6.3187, grad_norm: 45.8548                                                                                 
2022-07-03 17:36:48,350 - mmdet - INFO - Epoch [1][6950/7033]   lr: 2.000e-04, eta: 18 days, 2:22:22, time: 9.794, data_time: 2.617, memory: 49347, loss_cls: 0.3452, loss_bbox: 0.6631, d0.loss_cls: 0.3387, d0.loss_bbox: 0.7695, d1.loss_cls: 0.3399, d1.loss_
bbox: 0.6722, d2.loss_cls: 0.3386, d2.loss_bbox: 0.6633, d3.loss_cls: 0.3408, d3.loss_bbox: 0.6579, d4.loss_cls: 0.3407, d4.loss_bbox: 0.6581, loss: 6.1281, grad_norm: 48.2113                                                                                  
2022-07-03 17:45:12,259 - mmdet - INFO - Exp name: bevformer_base.py                                                                                                                                                                                             
2022-07-03 17:45:12,261 - mmdet - INFO - Epoch [1][7000/7033]   lr: 2.000e-04, eta: 18 days, 2:22:21, time: 10.079, data_time: 2.743, memory: 49347, loss_cls: 0.3570, loss_bbox: 0.6747, d0.loss_cls: 0.3517, d0.loss_bbox: 0.7780, d1.loss_cls: 0.3466, d1.loss
_bbox: 0.6869, d2.loss_cls: 0.3441, d2.loss_bbox: 0.6775, d3.loss_cls: 0.3488, d3.loss_bbox: 0.6760, d4.loss_cls: 0.3510, d4.loss_bbox: 0.6730, loss: 6.2653, grad_norm: 54.0465                                                                                 
2022-07-03 17:50:16,340 - mmdet - INFO - Saving checkpoint at 1 epochs                                                                                                                                                                                           
[                                                  ] 0/6019, elapsed: 0s, ETA:/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version
 of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.                                                                                                                       
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)                                        
  return torch.floor_divide(self, other)                                                                                                                                                                                                                         
/home/JJ_Group/wangz/wangzhe21/BEVFormer_wzh/projects/mmdet3d_plugin/core/bbox/coders/nms_free_coder.py:76: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(
True), rather than torch.tensor(sourceTensor).                                                                                                                                                                                                                   
  self.post_center_range = torch.tensor(                                                                                                                                                                                                                         
/home/JJ_Group/wangz/.conda/envs/mmdet3d_v0.17.1/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floo
r'). This results in incorrect rounding for negative values.                                                                                                                                                                                                     
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)                                        
  return torch.floor_divide(self, other)                                                                                                                                                                                                                         
[                                                  ] 2/6019, 0.1 task/s, elapsed: 16s, ETA: 47074s/home/JJ_Group/wangz/wangzhe21/BEVFormer_wzh/projects/mmdet3d_plugin/core/bbox/coders/nms_free_coder.py:76: UserWarning: To copy construct from a tensor, it is
 recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).                                                                                                                 
  self.post_center_range = torch.tensor(                                                                                                                                                                                                                         
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6020/6019, 3.0 task/s, elapsed: 2031s, ETA:     0s                                                                                                                                                          

Formating bboxes of pts_bbox                                                                                                                                                                                                                                     
Start to convert detection format...                                                                                                                                                                                                                             
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 8.9 task/s, elapsed: 680s, ETA:     0s                                                                                                                                                           
Results writes to val/./work_dirs/bevformer_base_bs_4/Sat_Jul__2_22_57_29_2022/pts_bbox/results_nusc.json                                                                                                                                                        
Evaluating bboxes of pts_bbox                                                                                                                                                                                                                                    
======                                                                                                                                                                                                                                                           
Loading NuScenes tables for version v1.0-trainval...                                                                                                                                                                                                             
23 category,                                                                                                                                                                                                                                                     
8 attribute,                                                                                                                                                                                                                                                     
4 visibility,                                                                                                                                                                                                                                                    
64386 instance,                                                                                                                                                                                                                                                  
12 sensor,                                                                                                                                                                                                                                                       
10200 calibrated_sensor,                                                                                                                                                                                                                                         
2631083 ego_pose,                                                                                                                                                                                                                                                
68 log,                                                                                                                                                                                                                                                          
850 scene,                                                                                                                                                                                                                                                       
34149 sample,                                                                                                                                                                                                                                                    
2631083 sample_data,                                                                                                                                                                                                                                             
1166187 sample_annotation,                                                                                                                                                                                                                                       
4 map,                                                                                                                                                                                                                                                           
Done loading in 165.923 seconds.                                                                                                                                                                                                                                 
======                                                                                                                                                                                                                                                           
Reverse indexing ...                                                                                                                                                                                                                                             
Done reverse indexing in 25.8 seconds.                                                                                          
======                                                                                                                          
Initializing nuScenes detection evaluation                                                                                      
Loaded results from val/./work_dirs/bevformer_base_bs_4/Sat_Jul__2_22_57_29_2022/pts_bbox/results_nusc.json. Found detections for 6019 samples.
Loading annotations for val split from nuScenes version: v1.0-trainval 

 [00:17<00:00, 338.76it/s]                                                                                                                                                                                                                                       
Loaded ground truth annotations for 6019 samples.                                                                                                                                                                                                                
Filtering predictions                                                                                                                                                                                                                                            
=> Original number of boxes: 1368478                                                                                                                                                                                                                             
=> After distance based filtering: 1368005                                                                                                                                                                                                                       
=> After LIDAR and RADAR points based filtering: 1368005                                                                                                                                                                                                         
=> After bike rack filtering: 1367610                                                                                                                                                                                                                            
Filtering ground truth annotations                                                                                                                                                                                                                               
=> Original number of boxes: 187528                                                                                                                                                                                                                              
=> After distance based filtering: 134565                                                                                                                                                                                                                        
=> After LIDAR and RADAR points based filtering: 121871                                                                                                                                                                                                          
=> After bike rack filtering: 121861                                                                                                                                                                                                                             
Accumulating metric data...                                                                                                                                                                                                                                      
Calculating metrics...                                                                                                                                                                                                                                           
Saving metrics to: val/./work_dirs/bevformer_base_bs_4/Sat_Jul__2_22_57_29_2022/pts_bbox                                                                                                                                                                         
mAP: 0.2559                                                                                                                                                                                                                                                      
mATE: 0.8938                                                                                                                                                                                                                                                     
mASE: 0.3263                                                                                                                                                                                                                                                     
mAOE: 0.6558                                                                                                                                                                                                                                                     
mAVE: 1.1642                                                                                                                                                                                                                                                     
mAAE: 0.3630                                                                                                                                                                                                                                                     
NDS: 0.3041                                                                                                                                                                                                                                                      
Eval time: 538.7s                                                                                                                                                                                                                                                

Per-class results:                                                                                                              
Object Class    AP      ATE     ASE     AOE     AVE     AAE                                                                     
car     0.436   0.671   0.176   0.149   1.922   0.458                                                                           
truck   0.214   0.828   0.255   0.226   1.109   0.346                                                                           
bus     0.251   0.914   0.286   0.255   2.448   0.644                                                                           
trailer 0.064   1.255   0.350   0.982   0.645   0.165                                                                           
construction_vehicle    0.059   1.085   0.510   1.329   0.124   0.332                                                           
pedestrian      0.346   0.866   0.328   0.735   0.765   0.398                                                                   
motorcycle      0.248   0.879   0.344   0.817   1.602   0.379                                                                   
bicycle 0.213   0.919   0.319   1.120   0.698   0.181                                                                           
traffic_cone    0.386   0.728   0.380   nan     nan     nan                                                                     
barrier 0.343   0.793   0.314   0.290   nan     nan                                                                             
2022-07-03 18:53:57,216 - mmdet - INFO - Exp name: bevformer_base.py 
zhiqi-li commented 2 years ago

Sorry, it seems the error was caused by external factors. If you resumed, was the problem still there?