[Multi node Training] Training time is very longer than a single node

Hello

There is a problem that the training time is very slow when learning the model with detectron2 using two machines

I use A6000 RTX with 4 GPUs per node and train my models with the two nodes. Two nodes are on Ubuntu 20.04. Training is normally working and the log.txt file is also generated well.

I set the environment variables as follows

Node1 setting(189) export NCCL_DEBUG="INFO" export NCCL_SOCKET_IFNAME="enp36s0f1" export GLOO_SOCKET_IFNAME="enp36s0f1"

Node2 setting export NCCL_DEBUG="INFO" export NCCL_SOCKET_IFNAME="enp4s0" export GLOO_SOCKET_IFNAME="enp4s0"

First, when I only set NCCL environment variables (not set GLOO), I got these errors

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/xgy/experiments/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker
main_func(*args)
File "/root/xgy/experiments/distributed-pytorch/MaskRCNN/train/train_net.py", line 141, in main
trainer = Trainer(cfg)
File "/root/xgy/experiments/detectron2/detectron2/engine/defaults.py", line 383, in init
data_loader = self.build_train_loader(cfg)
File "/root/xgy/experiments/detectron2/detectron2/engine/defaults.py", line 543, in build_train_loader
return build_detection_train_loader(cfg)
File "/root/xgy/experiments/detectron2/detectron2/config/config.py", line 192, in wrapped
explicit_args = _get_args_from_config(from_config, *args, **kwargs)
File "/root/xgy/experiments/detectron2/detectron2/config/config.py", line 229, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/root/xgy/experiments/detectron2/detectron2/data/build.py", line 328, in _train_loader_from_config
sampler = TrainingSampler(len(dataset))
File "/root/xgy/experiments/detectron2/detectron2/data/samplers/distributed_sampler.py", line 37, in init
seed = comm.shared_random_seed()
File "/root/xgy/experiments/detectron2/detectron2/utils/comm.py", line 230, in shared_random_seed
all_ints = all_gather(ints)
File "/root/xgy/experiments/detectron2/detectron2/utils/comm.py", line 154, in all_gather
group = _get_global_gloo_group()
File "/root/xgy/experiments/detectron2/detectron2/utils/comm.py", line 89, in _get_global_gloo_group
return dist.new_group(backend="gloo")
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2019, in new_group
pg = _new_process_group_helper(group_world_size,
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 504, in _new_process_group_helper
pg = ProcessGroupGloo(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [127.0.0.1]:7602: Connection refused

After I set export GLOO_SOCKET_IFNAME="enp4s0" and export GLOO_SOCKET_IFNAME="enp36s0f1" respectively, The training worked but the time is too slow. This is my NCCL BUG Report

cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379562:1379562 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379564:1379564 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379565:1379565 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379563:1379563 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 3[61000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 3[61000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 0[1000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 0[1000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO comm 0x7f7590002fb0 rank 0 nranks 8 cudaDev 0 busId 1000 - Init COMPLETE
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO comm 0x7f3928002fb0 rank 1 nranks 8 cudaDev 1 busId 2c000 - Init COMPLETE
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO comm 0x7f9f00002fb0 rank 3 nranks 8 cudaDev 3 busId 61000 - Init COMPLETE
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO comm 0x7fe414002fb0 rank 2 nranks 8 cudaDev 2 busId 41000 - Init COMPLETE
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Launch mode Parallel

For the record, according to this guide https://pytorch.org/docs/stable/distributed.html, "If you encounter any problem with NCCL, use Gloo as the fallback option. (Note that Gloo currently runs slower than NCCL for GPUs." distributed_sampler in detectron2 uses gloo backend.

When I type this command python -c "import torch;print(torch.cuda.nccl.version())"(NCCL Version check in Conda virtual Enviroment) (2, 10, 3) for both two machines I additionally didn't install NCCL (only installed Pytorch) What should I do?

facebookresearch / detectron2

[Multi node Training] Training time is very longer than a single node #4730