Multi Node Distribute Training

Lausannen commented 5 years ago

I want to know if apex supports multi-node GPU distribute training , I follow pytorch Document to use distributed.initilize(). In my case, I have two nodes, each node has 4 GPUs. I use the following command: In node 0: python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_port=2345 --nnodes=2 --node_rank=0 --master_addr="192.168.0.1" tools/train_net.py In node 1: python -m torch.distributed.launch --nproc_per_node=$NGPUS --master_port=2345 --nnodes=2 --node_rank=1 --master_addr="192.168.0.1" tools/train_net.py The torch.distributed.initialize is torch.distributed.init_process_group(backend="nccl",init_method="env://" ) When I try to run code, in node 0 it print a little training info, but in node 1 it provided some bug info like *python3.7/site-packages/apex/parallel/distributed.py", line 239, in init call(coalesced, extra_args) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch-nightly_1553663942394/work/torch/lib/c10d/ProcessGroupNCCL.cpp:260, unhandled system error**

mcarilli commented 5 years ago

Yes, apex should work with Pytorch's native multinode training (as described here, which i assume was the documentation you were following). Is $NGPUS here an environment variable set to 4?

Also, master_addr should be the IP address of node 0 on the network, so that is worth double-checking. Your launch command looks right otherwise...

MrRace commented 4 years ago

Yes, apex should work with Pytorch's native multinode training (as described here, which i assume was the documentation you were following). Is $NGPUS here an environment variable set to 4?

Also, master_addr should be the IP address of node 0 on the network, so that is worth double-checking. Your launch command looks right otherwise...

But one node can use 4 GPUs?python -m torch.distributed.launch --nproc_per_node=4 run_test.py. I see it just use the 0th GPU and there are 4 processes on the 0th GPU,other GPUs are unoccupied. Why??

NVIDIA / apex

Multi Node Distribute Training #260