Closed tonye0115 closed 2 years ago
What is your environment (system version, gpu version, cuda version, pytorch version, etc.)?
这个错误已经解决了:
是因为只有1个GPU卡,数据集加载的时候配置的4个NUM_WORKERS导致的,从下面的报错日志可以看出来。
[1029_163304 core.gdrn_modeling.data_loader@181]: Serialized dataset takes 26.39 MiB [1029_163304 core.gdrn_modeling.data_loader@688]: Using training sampler TrainingSampler ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker.
解决方法: 修改 https://github.com/THU-DA-6D-Pose-Group/GDR-Net/blob/main/configs/_base_/common_base.py 配置中的NUM_WORKERS=1
DATALOADER = dict(
NUM_WORKERS=1,
下面是启动运行的机器环境信息:
sys.platform linux
Python 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
numpy 1.21.3
detectron2 0.5 @/opt/detectron2/detectron2
Compiler GCC 9.3
CUDA compiler not available
DETECTRON2_ENV_MODULE
下面是运行训练的日志:
[1101_111538 core.utils.my_writer@171]: eta: 2 days, 6:16:59 iter: 98/102400[0.1%] time: 1.9102 lr: 9.8902e-06 max_mem: 3611M total_grad_norm: N/A total_loss: 2.2809e+01 (2.3269e+01) loss_coor_x: 4.2536e-01 (4.7196e-01) loss_coor_y: 4.1887e-01 (4.6651e-01) loss_coor_z: 4.8287e-01 (5.1969e-01) loss_mask: 2.3128e-01 (2.3072e-01) loss_region: 1.8391e+01 (1.8628e+01) loss_PM_R: 5.6937e-01 (5.9992e-01) loss_centroid: 1.3477e-01 (1.3292e-01) loss_z: 2.2039e+00 (2.2200e+00)
[1101_111540 core.utils.my_writer@171]: eta: 2 days, 6:21:31 iter: 99/102400[0.1%] time: 1.9129 lr: 9.9901e-06 max_mem: 3611M total_grad_norm: N/A total_loss: 2.2894e+01 (2.3289e+01) loss_coor_x: 4.2244e-01 (4.7130e-01) loss_coor_y: 4.2282e-01 (4.6632e-01) loss_coor_z: 4.8287e-01 (5.1886e-01) loss_mask: 2.3056e-01 (2.3048e-01) loss_region: 1.8432e+01 (1.8647e+01) loss_PM_R: 5.7124e-01 (6.0001e-01) loss_centroid: 1.3504e-01 (1.3299e-01) loss_z: 2.2039e+00 (2.2218e+00)
[1101_111844 core.utils.my_writer@171]: eta: 2 days, 5:48:08 iter: 199/102400[0.2%] time: 1.8952 lr: 1.9980e-05 max_mem: 3611M total_grad_norm: N/A total_loss: 2.3106e+01 (2.3154e+01) loss_coor_x: 3.6894e-01 (4.3536e-01) loss_coor_y: 3.6274e-01 (4.2414e-01) loss_coor_z: 3.9226e-01 (4.7548e-01) loss_mask: 2.2176e-01 (2.2896e-01) loss_region: 1.8975e+01 (1.8656e+01) loss_PM_R: 5.7044e-01 (5.8605e-01) loss_centroid: 1.3406e-01 (1.3313e-01) loss_z: 2.2203e+00 (2.2155e+00)
[1101_112149 core.utils.my_writer@171]: eta: 2 days, 4:40:08 iter: 299/102400[0.3%] time: 1.8571 lr: 2.9970e-05 max_mem: 3611M total_grad_norm: N/A total_loss: 2.2194e+01 (2.3024e+01) loss_coor_x: 2.8932e-01 (3.9977e-01) loss_coor_y: 2.8443e-01 (3.8773e-01) loss_coor_z: 3.0530e-01 (4.3300e-01) loss_mask: 2.2744e-01 (2.2783e-01) loss_region: 1.8227e+01 (1.8641e+01) loss_PM_R: 5.5547e-01 (5.7915e-01) loss_centroid: 1.3135e-01 (1.3336e-01) loss_z: 2.2389e+00 (2.2225e+00)
显示需要训练2天
I remember that it is a pytorch related bug, you can change a pytorch version.
################### train_dset_names: ('lm_13_train', 'lm_imgn_13_train_1k_per_obj') [1029_163303 core.gdrn_modeling.datasets.lm_dataset_d2@92]: load cached dataset dicts from /root/workspace/GDR-Net/.cache/dataset_dicts_lm_13_train_692bc90e98013deeafa8a88ef2dc5d9d.pkl [1029_163303 core.gdrn_modeling.datasets.lm_syn_imgn@96]: load cached dataset dicts from /root/workspace/GDR-Net/datasets/lm_imgn/dataset_dicts_lm_imgn_13_train_1k_per_obj_c445f2076027c576dd480cb1776d7c7e.pkl [10/29 16:33:03 d2.data.build]: Removed 0 images with no usable annotations. 15375 images left. [10/29 16:33:03 d2.data.build]: Distribution of instances among all 13 categories:
[1029_163303 core.gdrn_modeling.data_loader@116]: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(480,), max_size=640, sample_style='choice')] [1029_163304 core.gdrn_modeling.data_loader@176]: Serializing 15375 elements to byte tensors and concatenating them all ... [1029_163304 core.gdrn_modeling.data_loader@181]: Serialized dataset takes 26.39 MiB [1029_163304 core.gdrn_modeling.data_loader@688]: Using training sampler TrainingSampler ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. ERROR: Unexpected segmentation fault encountered in worker. [1029_163304@core/gdrn_modeling/engine.py:186] DBG images_per_batch: 24 [1029_163304@core/gdrn_modeling/engine.py:187] DBG dataset length: 15375 [1029_163304@core/gdrn_modeling/engine.py:188] DBG iters per epoch: 640 [1029_163304@core/gdrn_modeling/engine.py:189] DBG total iters: 102400 [1029_163304 core.gdrn_modeling.engine@193]: AMP enabled: False [10/29 16:33:04 fvcore.common.checkpoint]: No checkpoint found. Initializing model from scratch [1029_163304 core.gdrn_modeling.engine@242]: Starting training from iteration 0 ############## 0.0 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 987, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get if not self._poll(timeout): File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll r = wait([self], timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/opt/conda/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3258) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "core/gdrn_modeling/main_gdrn.py", line 153, in
launch(
File "/opt/detectron2/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "core/gdrn_modeling/main_gdrn.py", line 122, in main
do_train(cfg, args, model, optimizer, resume=args.resume)
File "/root/workspace/GDR-Net/core/gdrn_modeling/../../core/gdrn_modeling/engine.py", line 254, in do_train
data = next(data_loader_iter)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 518, in next
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1149, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1000, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3258) exited unexpectedly
运行参数是:Command line arguments: Namespace(config_file='/root/workspace/GDR-Net/configs/gdrn/lm/a6_cPnP_lm13.py', dist_url='tcp://10.250.11.64:1234', eval_only=False, fp16_allreduce=False, machine_rank=0, num_gpus=1, num_machines=1, opts=None, resume=False, use_adasum=False, use_hvd=False) [10/29 16:33:00 detectron2]: Contents of args.config_file=/root/workspace/GDR-Net/configs/gdrn/lm/a6_cPnP_lm13.py: base = ["../../base/gdrn_base.py"]