got stuck while distributed training

❓ Questions and Help

I am trying to train mask rcnn with two distributed nodes, one master and one slave. However, the program always get stuck when building data-parallel models. After a few minutes the slave node goes wrong with messages above.

Traceback (most recent call last):
  File "tools/train_net.py", line 194, in <module>
    main()
  File "tools/train_net.py", line 187, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 53, in train
    broadcast_buffers=False,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 220, in __init__
    self.broadcast_bucket_size)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 386, in _dist_broadcast_coalesced
    dist._dist_broadcast_coalesced(self.process_group, tensors, buffer_size, False)
RuntimeError: Resource temporarily unavailable

The master node also gets stuck, however it never gives any messages. I found that it stops while loading the pkl model file. I wonder what is the cause of this problem. Here are my commands to launch the two nodes.

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="172.17.62.8" --master_port 17334 tools/train_net.py --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000 SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR models/tmp

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="172.17.62.8" --master_port 17334 tools/train_net.py --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000 SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR models/tmp

Thanks in advance.

facebookresearch / maskrcnn-benchmark

got stuck while distributed training #931

❓ Questions and Help