THU-DA-6D-Pose-Group / GDR-Net

GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. (CVPR 2021)
https://github.com/THU-DA-6D-Pose-Group/GDR-Net
Apache License 2.0
274 stars 46 forks source link

ERROR: Unexpected segmentation fault encountered in worker. #47

Closed tonye0115 closed 2 years ago

tonye0115 commented 3 years ago

################### train_dset_names: ('lm_13_train', 'lm_imgn_13_train_1k_per_obj') [1029_163303 core.gdrn_modeling.datasets.lm_dataset_d2@92]: load cached dataset dicts from /root/workspace/GDR-Net/.cache/dataset_dicts_lm_13_train_692bc90e98013deeafa8a88ef2dc5d9d.pkl [1029_163303 core.gdrn_modeling.datasets.lm_syn_imgn@96]: load cached dataset dicts from /root/workspace/GDR-Net/datasets/lm_imgn/dataset_dicts_lm_imgn_13_train_1k_per_obj_c445f2076027c576dd480cb1776d7c7e.pkl [10/29 16:33:03 d2.data.build]: Removed 0 images with no usable annotations. 15375 images left. [10/29 16:33:03 d2.data.build]: Distribution of instances among all 13 categories:

[1029_163303 core.gdrn_modeling.data_loader@116]: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(480,), max_size=640, sample_style='choice')] [1029_163304 core.gdrn_modeling.data_loader@176]: Serializing 15375 elements to byte tensors and concatenating them all ... [1029_163304 core.gdrn_modeling.data_loader@181]: Serialized dataset takes 26.39 MiB [1029_163304 core.gdrn_modeling.data_loader@688]: Using training sampler TrainingSampler ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. [1029_163304@core/gdrn_modeling/engine.py:186] DBG images_per_batch: 24 [1029_163304@core/gdrn_modeling/engine.py:187] DBG dataset length: 15375 [1029_163304@core/gdrn_modeling/engine.py:188] DBG iters per epoch: 640 [1029_163304@core/gdrn_modeling/engine.py:189] DBG total iters: 102400 [1029_163304 core.gdrn_modeling.engine@193]: AMP enabled: False [10/29 16:33:04 fvcore.common.checkpoint]: No checkpoint found. Initializing model from scratch [1029_163304 core.gdrn_modeling.engine@242]: Starting training from iteration 0 ############## 0.0 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 987, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get if not self._poll(timeout): File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll r = wait([self], timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/opt/conda/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3258) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "core/gdrn_modeling/main_gdrn.py", line 153, in launch( File "/opt/detectron2/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "core/gdrn_modeling/main_gdrn.py", line 122, in main do_train(cfg, args, model, optimizer, resume=args.resume) File "/root/workspace/GDR-Net/core/gdrn_modeling/../../core/gdrn_modeling/engine.py", line 254, in do_train data = next(data_loader_iter) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 518, in next data = self._next_data() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data idx, data = self._get_data() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1149, in _get_data success, data = self._try_get_data() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1000, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 3258) exited unexpectedly


运行参数是:Command line arguments: Namespace(config_file='/root/workspace/GDR-Net/configs/gdrn/lm/a6_cPnP_lm13.py', dist_url='tcp://10.250.11.64:1234', eval_only=False, fp16_allreduce=False, machine_rank=0, num_gpus=1, num_machines=1, opts=None, resume=False, use_adasum=False, use_hvd=False) [10/29 16:33:00 detectron2]: Contents of args.config_file=/root/workspace/GDR-Net/configs/gdrn/lm/a6_cPnP_lm13.py: base = ["../../base/gdrn_base.py"]

wangg12 commented 3 years ago

What is your environment (system version, gpu version, cuda version, pytorch version, etc.)?

tonye0115 commented 3 years ago

这个错误已经解决了:

是因为只有1个GPU卡,数据集加载的时候配置的4个NUM_WORKERS导致的,从下面的报错日志可以看出来。

[1029_163304 core.gdrn_modeling.data_loader@181]: Serialized dataset takes 26.39 MiB [1029_163304 core.gdrn_modeling.data_loader@688]: Using training sampler TrainingSampler ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker.

解决方法: 修改 https://github.com/THU-DA-6D-Pose-Group/GDR-Net/blob/main/configs/_base_/common_base.py 配置中的NUM_WORKERS=1

-----------------------------------------------------------------------------

DataLoader

-----------------------------------------------------------------------------

DATALOADER = dict(

Number of data loading threads

NUM_WORKERS=1,

下面是启动运行的机器环境信息:


sys.platform linux Python 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] numpy 1.21.3 detectron2 0.5 @/opt/detectron2/detectron2 Compiler GCC 9.3 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 1.9.0a0+df837d0 @/opt/conda/lib/python3.8/site-packages/torch PyTorch debug build False GPU available Yes GPU 0 NVIDIA GeForce RTX 2070 Super (arch=7.5) Driver version 470.57.02 CUDA_HOME /usr/local/cuda TORCH_CUDA_ARCH_LIST 5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX Pillow 7.0.0.post3 torchvision 0.9.0a0 @/opt/conda/lib/python3.8/site-packages/torchvision torchvision arch flags 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20210415 iopath 0.1.8 cv2 4.5.4-dev


下面是运行训练的日志:

[1101_111538 core.utils.my_writer@171]: eta: 2 days, 6:16:59 iter: 98/102400[0.1%] time: 1.9102 lr: 9.8902e-06 max_mem: 3611M total_grad_norm: N/A total_loss: 2.2809e+01 (2.3269e+01) loss_coor_x: 4.2536e-01 (4.7196e-01) loss_coor_y: 4.1887e-01 (4.6651e-01) loss_coor_z: 4.8287e-01 (5.1969e-01) loss_mask: 2.3128e-01 (2.3072e-01) loss_region: 1.8391e+01 (1.8628e+01) loss_PM_R: 5.6937e-01 (5.9992e-01) loss_centroid: 1.3477e-01 (1.3292e-01) loss_z: 2.2039e+00 (2.2200e+00)

[1101_111540 core.utils.my_writer@171]: eta: 2 days, 6:21:31 iter: 99/102400[0.1%] time: 1.9129 lr: 9.9901e-06 max_mem: 3611M total_grad_norm: N/A total_loss: 2.2894e+01 (2.3289e+01) loss_coor_x: 4.2244e-01 (4.7130e-01) loss_coor_y: 4.2282e-01 (4.6632e-01) loss_coor_z: 4.8287e-01 (5.1886e-01) loss_mask: 2.3056e-01 (2.3048e-01) loss_region: 1.8432e+01 (1.8647e+01) loss_PM_R: 5.7124e-01 (6.0001e-01) loss_centroid: 1.3504e-01 (1.3299e-01) loss_z: 2.2039e+00 (2.2218e+00)

[1101_111844 core.utils.my_writer@171]: eta: 2 days, 5:48:08 iter: 199/102400[0.2%] time: 1.8952 lr: 1.9980e-05 max_mem: 3611M total_grad_norm: N/A total_loss: 2.3106e+01 (2.3154e+01) loss_coor_x: 3.6894e-01 (4.3536e-01) loss_coor_y: 3.6274e-01 (4.2414e-01) loss_coor_z: 3.9226e-01 (4.7548e-01) loss_mask: 2.2176e-01 (2.2896e-01) loss_region: 1.8975e+01 (1.8656e+01) loss_PM_R: 5.7044e-01 (5.8605e-01) loss_centroid: 1.3406e-01 (1.3313e-01) loss_z: 2.2203e+00 (2.2155e+00)

[1101_112149 core.utils.my_writer@171]: eta: 2 days, 4:40:08 iter: 299/102400[0.3%] time: 1.8571 lr: 2.9970e-05 max_mem: 3611M total_grad_norm: N/A total_loss: 2.2194e+01 (2.3024e+01) loss_coor_x: 2.8932e-01 (3.9977e-01) loss_coor_y: 2.8443e-01 (3.8773e-01) loss_coor_z: 3.0530e-01 (4.3300e-01) loss_mask: 2.2744e-01 (2.2783e-01) loss_region: 1.8227e+01 (1.8641e+01) loss_PM_R: 5.5547e-01 (5.7915e-01) loss_centroid: 1.3135e-01 (1.3336e-01) loss_z: 2.2389e+00 (2.2225e+00)

显示需要训练2天

wangg12 commented 3 years ago

I remember that it is a pytorch related bug, you can change a pytorch version.