facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.42k stars 7.47k forks source link

PointRend training error "ERROR: Unexpected bus error encountered in worker." #812

Closed sunnyheart008205 closed 4 years ago

sunnyheart008205 commented 4 years ago
  1. I used the following commad to train PointRend by using Coco 2017 dataset, and got errors

    python ./projects/PointRend/train_net.py --config-file ./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml --num-gpus 4
  2. the log/error is as follows, Command Line Args: Namespace(config_file='./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False) [02/06 07:12:25 detectron2]: Rank of current process: 0. World size: 4 '''''''''''''''''''' [02/06 07:12:26 detectron2]: Command line arguments: Namespace(config_file='./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=[], resume=False) [02/06 07:12:26 detectron2]: Contents of args.config_file=./projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml: BASE: Base-PointRend-RCNN-FPN.yaml MODEL: WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl MASK_ON: true RESNETS: DEPTH: 50 ................ [02/06 07:12:46 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.61 seconds. [02/06 07:12:47 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [02/06 07:12:55 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left. [02/06 07:12:59 d2.data.build]: Distribution of instances among all 80 categories: category #instances category #instances category #instances
    person 257253 bicycle 7056 car 43533
    motorcycle 8654 airplane 5129 bus 6061
    train 4570 truck 9970 boat 10576
    traffic light 12842 fire hydrant 1865 stop sign 1983
    parking meter 1283 bench 9820 bird 10542
    cat 4766 dog 5500 horse 6567
    sheep 9223 cow 8014 elephant 5484
    bear 1294 zebra 5269 giraffe 5128
    backpack 8714 umbrella 11265 handbag 12342
    tie 6448 suitcase 6112 frisbee 2681
    skis 6623 snowboard 2681 sports ball 6299
    kite 8802 baseball bat 3273 baseball gl.. 3747
    skateboard 5536 surfboard 6095 tennis racket 4807
    bottle 24070 wine glass 7839 cup 20574
    fork 5474 knife 7760 spoon 6159
    bowl 14323 banana 9195 apple 5776
    sandwich 4356 orange 6302 broccoli 7261
    carrot 7758 hot dog 2884 pizza 5807
    donut 7005 cake 6296 chair 38073
    couch 5779 potted plant 8631 bed 4192
    dining table 15695 toilet 4149 tv 5803
    laptop 4960 mouse 2261 remote 5700
    keyboard 2854 cell phone 6422 microwave 1672
    oven 3334 toaster 225 sink 5609
    refrigerator 2634 book 24077 clock 6320
    vase 6577 scissors 1464 teddy bear 4729
    hair drier 198 toothbrush 1945
    total 849949

    [02/06 07:12:59 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [02/06 07:12:59 d2.data.build]: Using training sampler TrainingSampler [02/06 07:13:01 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl [02/06 07:13:01 fvcore.common.file_io]: URL https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-50.pkl cached in /root/.torch/fvcore_cache/detectron2/ImageNetPretrained/MSRA/R-50.pkl '''''''''''''''' [02/06 07:13:01 d2.checkpoint.c2_model_loading]: Some model parameters are not in the checkpoint: backbone.fpn_lateral2.{bias, weight} backbone.fpn_lateral3.{bias, weight} backbone.fpn_lateral4.{bias, weight} backbone.fpn_lateral5.{bias, weight} backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.anchor_generator.cell_anchors.{0, 1, 2, 3, 4} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_coarse_head.coarse_mask_fc1.{bias, weight} roi_heads.mask_coarse_head.coarse_mask_fc2.{bias, weight} roi_heads.mask_coarse_head.prediction.{bias, weight} roi_heads.mask_coarse_head.reduce_spatial_dim_conv.{bias, weight} roi_heads.mask_point_head.fc1.{bias, weight} roi_heads.mask_point_head.fc2.{bias, weight} roi_heads.mask_point_head.fc3.{bias, weight} roi_heads.mask_point_head.predictor.{bias, weight} [02/06 07:13:01 d2.checkpoint.c2_model_loading]: The checkpoint contains parameters not used by the model: fc1000_b fc1000_w conv1_b [02/06 07:13:02 d2.engine.train_loop]: Starting training from iteration 0 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage fd, size = storage._sharefd() RuntimeError: unable to write to file Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 234, in _feed

  3. the GPU memory seems to be enough. Thu Feb 6 07:41:13 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:5A:00.0 Off | 0 | | N/A 62C P0 101W / 250W | 10823MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:5E:00.0 Off | 0 | | N/A 55C P0 96W / 250W | 10298MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 00000000:62:00.0 Off | 0 | | N/A 58C P0 103W / 250W | 10298MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 00000000:66:00.0 Off | 0 | | N/A 59C P0 103W / 250W | 10298MiB / 32510MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... On | 00000000:B5:00.0 Off | 0 | | N/A 56C P0 102W / 250W | 10290MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... On | 00000000:B9:00.0 Off | 0 | | N/A 62C P0 110W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-PCIE... On | 00000000:BD:00.0 Off | 0 | | N/A 58C P0 62W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-PCIE... On | 00000000:C1:00.0 Off | 0 | | N/A 57C P0 57W / 250W | 10296MiB / 32510MiB | 99% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================|

Environment:

Run python -m detectron2.utils.collect_env in the environment where you observerd the issue, and paste the output. [02/06 07:12:26 detectron2]: Environment info:


sys.platform linux Python 3.6.9 Anaconda, Inc. (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] numpy 1.17.2 detectron2 0.1 @/data/xxx/InstanceSegmentation/detectron2/detectron2-master/detectron2 detectron2 compiler GCC 5.4 detectron2 CUDA compiler 10.1 detectron2 arch flags sm_70 DETECTRON2_ENV_MODULE PyTorch 1.3.0 @/opt/conda/lib/python3.6/site-packages/torch PyTorch debug build False CUDA available True GPU 0,1,2,3,4,5,6,7 Tesla V100-PCIE-32GB CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.1, V10.1.243 Pillow 6.2.2 torchvision 0.4.1a0+d94043a @/opt/conda/lib/python3.6/site-packages/torchvision torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75 cv2 4.2.0

PyTorch built with:

How can I do? Thanks

ppwwyyxx commented 4 years ago

You can find similar issues in https://github.com/pytorch/pytorch/issues/ such as https://github.com/pytorch/pytorch/issues/2926. Please follow discussions there as this is not related to detectron2.