NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.15k stars 793 forks source link

NCCL not working when each rank only sees its own GPU #1066

Open casparvl opened 10 months ago

casparvl commented 10 months ago

Issue I'm working on a SLURM system, where I noticed the following issue with a small synthetic benchmark running PyTorch DDP, with NCCL as backend. I ran this in two ways, I'll refer to them as Case 1 and Case 2. Case 1:

$ srun -n 2 -c 18 --gpus-per-task 1 python3 pytorch_synthetic_benchmark.py --use-ddp --num-iter 2
Iter #0: 687.1 img/sec per GPU
Iter #1: 669.4 img/sec per GPU

But if I specify --gpus instead of gpus-per-task (Case 2):

$ srun -n 2 -c 18 --gpus 2 python3 pytorch_synthetic_benchmark.py --use-ddp --num-iter 2
Iter #0: 791.6 img/sec per GPU
Iter #1: 767.6 img/sec per GPU

As you can see, the second performance is much better. Using dcgmi dmon -e 1011,1012, I noticed that the first run was not using NVLINK, whereas the second run is using NVLINK.

Running Case 1 with NCCL_DEBUG:

gcn6:1080060:1080060 [0] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080060:1080060 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080060:1080060 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080060:1080060 [0] NCCL INFO Using network IB
NCCL version 2.12.12+cuda11.7
gcn6:1080061:1080061 [0] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080061:1080061 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080061:1080061 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080061:1080061 [0] NCCL INFO Using network IB

gcn6:1080061:1080359 [0] misc/nvmlwrap.cc:181 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

gcn6:1080060:1080354 [0] misc/nvmlwrap.cc:181 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
gcn6:1080060:1080354 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff
gcn6:1080061:1080359 [0] NCCL INFO Setting affinity for GPU 0 to 0f,fffc0000
gcn6:1080060:1080354 [0] NCCL INFO Channel 00/02 :    0   1
gcn6:1080060:1080354 [0] NCCL INFO Channel 01/02 :    0   1
gcn6:1080061:1080359 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
gcn6:1080060:1080354 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
gcn6:1080061:1080359 [0] NCCL INFO Channel 00 : 1[32000] -> 0[31000] via direct shared memory
gcn6:1080060:1080354 [0] NCCL INFO Channel 00 : 0[31000] -> 1[32000] via direct shared memory
gcn6:1080061:1080359 [0] NCCL INFO Channel 01 : 1[32000] -> 0[31000] via direct shared memory
gcn6:1080060:1080354 [0] NCCL INFO Channel 01 : 0[31000] -> 1[32000] via direct shared memory
gcn6:1080061:1080359 [0] NCCL INFO Connected all rings
gcn6:1080061:1080359 [0] NCCL INFO Connected all trees
gcn6:1080061:1080359 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080061:1080359 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gcn6:1080060:1080354 [0] NCCL INFO Connected all rings
gcn6:1080060:1080354 [0] NCCL INFO Connected all trees
gcn6:1080060:1080354 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080060:1080354 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gcn6:1080061:1080359 [0] NCCL INFO comm 0x7f356c0090d0 rank 1 nranks 2 cudaDev 0 busId 32000 - Init COMPLETE
gcn6:1080060:1080354 [0] NCCL INFO comm 0x7f6e800090d0 rank 0 nranks 2 cudaDev 0 busId 31000 - Init COMPLETE
gcn6:1080060:1080060 [0] NCCL INFO Launch mode Parallel

While running Case 2:

host: gcn6.local.snellius.surf.nl, rank: 0, local_rank: 0
gcn6:1080641:1080641 [0] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080641:1080641 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080641:1080641 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080641:1080641 [0] NCCL INFO Using network IB
NCCL version 2.12.12+cuda11.7
gcn6:1080642:1080642 [1] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080642:1080642 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080642:1080642 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080642:1080642 [1] NCCL INFO Using network IB
gcn6:1080641:1080851 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff
gcn6:1080642:1080858 [1] NCCL INFO Setting affinity for GPU 1 to 0f,fffc0000
gcn6:1080641:1080851 [0] NCCL INFO Channel 00/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 01/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 02/08 :    0   1
gcn6:1080642:1080858 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0
gcn6:1080641:1080851 [0] NCCL INFO Channel 03/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 04/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 05/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 06/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 07/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1
gcn6:1080642:1080858 [1] NCCL INFO Channel 00 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 00 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 01 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 01 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 02 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 02 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 03 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 03 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 04 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 04 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 05 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 05 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 06 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 06 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 07 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 07 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Connected all rings
gcn6:1080642:1080858 [1] NCCL INFO Connected all trees
gcn6:1080642:1080858 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080642:1080858 [1] NCCL INFO 8 coll channels, 8 p2p channels, 8 p2p channels per peer
gcn6:1080641:1080851 [0] NCCL INFO Connected all rings
gcn6:1080641:1080851 [0] NCCL INFO Connected all trees
gcn6:1080641:1080851 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080641:1080851 [0] NCCL INFO 8 coll channels, 8 p2p channels, 8 p2p channels per peer
gcn6:1080642:1080858 [1] NCCL INFO comm 0x7f61c40090d0 rank 1 nranks 2 cudaDev 1 busId 32000 - Init COMPLETE
gcn6:1080641:1080851 [0] NCCL INFO comm 0x7fcbf80090d0 rank 0 nranks 2 cudaDev 0 busId 31000 - Init COMPLETE
gcn6:1080641:1080641 [0] NCCL INFO Launch mode Parallel

The big difference between --gpus-per-task 1 and --gpus 2 is that in the first first case, SLURM limits access for each rank to one GPU:

$ srun -n 2 -c 18 --gpus-per-task 1 nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-dfc2d25d-2803-f8e4-17b1-5d2bf5838777)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ee9d03c0-33f4-ab88-1362-ace6ce89575d)

Wherease in Case 2, each rank has access to both GPUs:

$ srun -n 2 -c 18 --gpus 2 nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ee9d03c0-33f4-ab88-1362-ace6ce89575d)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-dfc2d25d-2803-f8e4-17b1-5d2bf5838777)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ee9d03c0-33f4-ab88-1362-ace6ce89575d)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-dfc2d25d-2803-f8e4-17b1-5d2bf5838777)

Potentially related issues https://github.com/NVIDIA/nccl/issues/1017 https://github.com/NVIDIA/nccl/issues/324 https://github.com/NVIDIA/pyxis/issues/73

Actual question I have a pretty good grasp on what is happening here: I guess the NCCL init fails to discover that both GPUs are physically connected, since each process is limited to it's own GPU by a cgroup (set by SLURM). I guess my actual question is: is this a bug/limitation in how NCCL is initialized? I.e. if there were a way to discover accross cgroups that the other GPU is in the same node, would it help? Or would a succesful init not even help because a process (e.g. rank 0, running its compute on GPU 0) need access to 'the other' process' GPU (i.e. GPU 1) in order to even do IPC (and that access is simply not possible due to the cgroups)?

This comment and this comment seem to suggest it the latter, but since that ticket is about GPUs being isolated in different containers (whereas in this case they are 'only' in different c-groups), I'm wasn't sure. I don't know the techincal details of IPC, but I would half expect this to be handled at a different level (kernel? driver, i.e. root?) than the user process, in which case it would/should be possible to communicate across cgroups.

It would be a shame if there is no resolution to this for two reasons

  1. The silent fallback to communication over PCIe could mean a lot of users on SLURM systems are (unkowingly) leaving a lot of performance on the table

  2. While I could recommend users on our cluster to use --gpus or --gpus-per-node (which also doesn't put GPUs in a cgroup per task) instead of --gpus-per-task, the advantage of using --gpus-per-task is that it is easier in the user code: the code then doesn't have to handle device placement explicitely for each rank, since each process only sees a single GPU.

sjeaugey commented 10 months ago

Indeed, NCCL needs to see all GPUs on the same node for all NVLink detection to work properly.