facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.31k stars 7.45k forks source link

[Multi node Training] Training time is very longer than a single node #4730

Open daebakk opened 1 year ago

daebakk commented 1 year ago

Hello

There is a problem that the training time is very slow when learning the model with detectron2 using two machines

I use A6000 RTX with 4 GPUs per node and train my models with the two nodes. Two nodes are on Ubuntu 20.04. Training is normally working and the log.txt file is also generated well.

I set the environment variables as follows

Node1 setting(189) export NCCL_DEBUG="INFO" export NCCL_SOCKET_IFNAME="enp36s0f1" export GLOO_SOCKET_IFNAME="enp36s0f1"

Node2 setting export NCCL_DEBUG="INFO" export NCCL_SOCKET_IFNAME="enp4s0" export GLOO_SOCKET_IFNAME="enp4s0"

First, when I only set NCCL environment variables (not set GLOO), I got these errors

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/xgy/experiments/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker
main_func(*args)
File "/root/xgy/experiments/distributed-pytorch/MaskRCNN/train/train_net.py", line 141, in main
trainer = Trainer(cfg)
File "/root/xgy/experiments/detectron2/detectron2/engine/defaults.py", line 383, in init
data_loader = self.build_train_loader(cfg)
File "/root/xgy/experiments/detectron2/detectron2/engine/defaults.py", line 543, in build_train_loader
return build_detection_train_loader(cfg)
File "/root/xgy/experiments/detectron2/detectron2/config/config.py", line 192, in wrapped
explicit_args = _get_args_from_config(from_config, *args, **kwargs)
File "/root/xgy/experiments/detectron2/detectron2/config/config.py", line 229, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/root/xgy/experiments/detectron2/detectron2/data/build.py", line 328, in _train_loader_from_config
sampler = TrainingSampler(len(dataset))
File "/root/xgy/experiments/detectron2/detectron2/data/samplers/distributed_sampler.py", line 37, in init
seed = comm.shared_random_seed()
File "/root/xgy/experiments/detectron2/detectron2/utils/comm.py", line 230, in shared_random_seed
all_ints = all_gather(ints)
File "/root/xgy/experiments/detectron2/detectron2/utils/comm.py", line 154, in all_gather
group = _get_global_gloo_group()
File "/root/xgy/experiments/detectron2/detectron2/utils/comm.py", line 89, in _get_global_gloo_group
return dist.new_group(backend="gloo")
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2019, in new_group
pg = _new_process_group_helper(group_world_size,
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 504, in _new_process_group_helper
pg = ProcessGroupGloo(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [127.0.0.1]:7602: Connection refused

After I set export GLOO_SOCKET_IFNAME="enp4s0" and export GLOO_SOCKET_IFNAME="enp36s0f1" respectively, The training worked but the time is too slow. This is my NCCL BUG Report

cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379562:1379562 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379564:1379564 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379564:1379564 [2] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379565:1379565 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Bootstrap : Using enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379565:1379565 [3] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

cvlab189-System-Product-Name:1379563:1379563 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO NET/Socket : Using [0]enp36s0f1:168.188.129.189<0>
cvlab189-System-Product-Name:1379563:1379563 [1] NCCL INFO Using network Socket
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 3[61000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 3[61000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 7[68000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 00 : 3[61000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Channel 01 : 3[61000] -> 2[41000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all rings
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 00 : 2[41000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 00 : 1[2c000] -> 0[1000] via P2P/IPC
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Channel 01 : 2[41000] -> 1[2c000] via P2P/IPC
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Channel 01 : 1[2c000] -> 0[1000] via P2P/IPC
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 4[19000] -> 0[1000] [receive] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 00 : 0[1000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Channel 01 : 0[1000] -> 4[19000] [send] via NET/Socket/0
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO Connected all trees
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cvlab189-System-Product-Name:1379562:1379735 [0] NCCL INFO comm 0x7f7590002fb0 rank 0 nranks 8 cudaDev 0 busId 1000 - Init COMPLETE
cvlab189-System-Product-Name:1379563:1379738 [1] NCCL INFO comm 0x7f3928002fb0 rank 1 nranks 8 cudaDev 1 busId 2c000 - Init COMPLETE
cvlab189-System-Product-Name:1379565:1379737 [3] NCCL INFO comm 0x7f9f00002fb0 rank 3 nranks 8 cudaDev 3 busId 61000 - Init COMPLETE
cvlab189-System-Product-Name:1379564:1379736 [2] NCCL INFO comm 0x7fe414002fb0 rank 2 nranks 8 cudaDev 2 busId 41000 - Init COMPLETE
cvlab189-System-Product-Name:1379562:1379562 [0] NCCL INFO Launch mode Parallel

For the record, according to this guide https://pytorch.org/docs/stable/distributed.html, "If you encounter any problem with NCCL, use Gloo as the fallback option. (Note that Gloo currently runs slower than NCCL for GPUs." distributed_sampler in detectron2 uses gloo backend.

When I type this command python -c "import torch;print(torch.cuda.nccl.version())"(NCCL Version check in Conda virtual Enviroment) (2, 10, 3) for both two machines I additionally didn't install NCCL (only installed Pytorch) What should I do?

github-actions[bot] commented 1 year ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";