Open volgachen opened 5 years ago
I have found the solution. If I use not all of the gpus on one machine, it will get stuck. I have to set the number of CUDA_VISIBLE_DEVICES equal to the param _nproc_pernode, so there is no free gpu visible. Could anyone give me a more elemental explanation?
❓ Questions and Help
I am trying to train mask rcnn with two distributed nodes, one master and one slave. However, the program always get stuck when building data-parallel models. After a few minutes the slave node goes wrong with messages above.
The master node also gets stuck, however it never gives any messages. I found that it stops while loading the pkl model file. I wonder what is the cause of this problem. Here are my commands to launch the two nodes.
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="172.17.62.8" --master_port 17334 tools/train_net.py --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000 SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR models/tmp
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="172.17.62.8" --master_port 17334 tools/train_net.py --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000 SOLVER.IMS_PER_BATCH 8 OUTPUT_DIR models/tmp
Thanks in advance.