NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 825 forks source link

Could not enable P2P between devices #1516

Open ZhiyiHu1999 opened 15 hours ago

ZhiyiHu1999 commented 15 hours ago

Hello! I am doing all-to-all communication using ncclSend() and ncclReccv() between 4 GPUs on two nodes, with 2 GPUs per node. However, it seems that GPUs on the same node cannot do this P2P communication and here is the debug info. Could you help with telling me why this is the case. Thanks a lot!

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-c2927e9a-08b2-a1cd-17a1-092a418bac6e)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-9d7f82f3-c503-8ee9-26f2-424ddefe47e2)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-96b59d15-054d-c99f-68bf-406414f8d051)
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-cdb8590d-7b7f-deec-f4f5-8c132c2bbfed)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-4bab2c5f-3d91-e162-3057-f8068c36d803)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-f5a5d6fc-28e6-6fda-a6fe-0c13357c69bb)
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-c2927e9a-08b2-a1cd-17a1-092a418bac6e)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-9d7f82f3-c503-8ee9-26f2-424ddefe47e2)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-96b59d15-054d-c99f-68bf-406414f8d051)
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-cdb8590d-7b7f-deec-f4f5-8c132c2bbfed)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-4bab2c5f-3d91-e162-3057-f8068c36d803)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-f5a5d6fc-28e6-6fda-a6fe-0c13357c69bb)
The local rank is: 0
ault43:1696338:1696338 [0] NCCL INFO Bootstrap : Using ib0:192.168.0.43<0>
ault43:1696338:1696338 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault43:1696338:1696338 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault43:1696338:1696338 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.17.1+cuda11.8
The local rank is: 0
ault44:1424218:1424218 [0] NCCL INFO cudaDriverVersion 12010
ault44:1424218:1424218 [0] NCCL INFO Bootstrap : Using ib0:192.168.0.44<0>
ault44:1424218:1424218 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault44:1424218:1424218 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault44:1424218:1424218 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.44<0>
ault44:1424218:1424218 [0] NCCL INFO Using network IB
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.43<0>
ault43:1696338:1696338 [0] NCCL INFO Using network IB
ault43:1696338:1696338 [0] NCCL INFO NCCL_TOPO_DUMP_FILE set by environment to ./results/Topology_Intra_Node.txt
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO NCCL_GRAPH_DUMP_FILE set by environment to ./results/Graph.txt
ault43:1696338:1696338 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault43:1696338:1696338 [0] NCCL INFO Channel 00/02 :    0   1   2   3
ault43:1696338:1696338 [0] NCCL INFO Channel 01/02 :    0   1   2   3
ault43:1696338:1696338 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
ault43:1696338:1696338 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
ault43:1696338:1696338 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
ault43:1696338:1696338 [0] NCCL INFO P2P Chunksize set to 131072
ault43:1696338:1696338 [0] NCCL INFO Channel 00/0 : 3[81000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 01/0 : 3[81000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Channel 00 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Channel 01 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696338 [0] NCCL INFO Connected all rings
ault43:1696338:1696338 [0] NCCL INFO Channel 00/0 : 2[41000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 01/0 : 2[41000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 00/0 : 0[41000] -> 2[41000] [send] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 01/0 : 0[41000] -> 2[41000] [send] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Connected all trees
ault43:1696338:1696338 [0] NCCL INFO NCCL_PROTO set by environment to LL
ault43:1696338:1696338 [0] NCCL INFO NCCL_ALGO set by environment to Tree
ault43:1696338:1696338 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault43:1696338:1696338 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault43:1696338:1696338 [0] NCCL INFO comm 0x666cdb0 rank 0 nranks 4 cudaDev 0 busId 41000 commId 0xecdec238786580e6 - Init COMPLETE
ault43:1696338:1696401 [0] NCCL INFO Channel 00/1 : 3[81000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 01/1 : 3[81000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696401 [0] NCCL INFO Channel 00 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696401 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696401 [0] NCCL INFO Channel 01 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696401 [0] NCCL INFO Channel 00/1 : 2[41000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 01/1 : 2[41000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 00/1 : 0[41000] -> 2[41000] [send] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 01/1 : 0[41000] -> 2[41000] [send] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault44:1424218:1424218 [0] NCCL INFO Ring 00 : 1 -> 2 -> 3
ault44:1424218:1424218 [0] NCCL INFO Ring 01 : 1 -> 2 -> 3
ault44:1424218:1424218 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
ault44:1424218:1424218 [0] NCCL INFO P2P Chunksize set to 131072
ault44:1424218:1424218 [0] NCCL INFO Channel 00/0 : 1[81000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 01/0 : 1[81000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Channel 00 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Channel 01 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424218 [0] NCCL INFO Connected all rings
ault44:1424218:1424218 [0] NCCL INFO Channel 00/0 : 0[41000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 01/0 : 0[41000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 00/0 : 2[41000] -> 0[41000] [send] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 01/0 : 2[41000] -> 0[41000] [send] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Connected all trees
ault44:1424218:1424218 [0] NCCL INFO NCCL_PROTO set by environment to LL
ault44:1424218:1424218 [0] NCCL INFO NCCL_ALGO set by environment to Tree
ault44:1424218:1424218 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault44:1424218:1424218 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault44:1424218:1424218 [0] NCCL INFO comm 0x666d2e0 rank 2 nranks 4 cudaDev 0 busId 41000 commId 0xecdec238786580e6 - Init COMPLETE
ault44:1424218:1424272 [0] NCCL INFO Channel 00/1 : 1[81000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 01/1 : 1[81000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424272 [0] NCCL INFO Channel 00 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424272 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424272 [0] NCCL INFO Channel 01 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424272 [0] NCCL INFO Channel 00/1 : 0[41000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 01/1 : 0[41000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 00/1 : 2[41000] -> 0[41000] [send] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 01/1 : 2[41000] -> 0[41000] [send] via NET/IB/0/Shared
The local rank is: 1
ault43:1696335:1696335 [1] NCCL INFO cudaDriverVersion 12010
ault43:1696335:1696335 [1] NCCL INFO Bootstrap : Using ib0:192.168.0.43<0>
ault43:1696335:1696335 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault43:1696335:1696335 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault43:1696335:1696335 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.43<0>
ault43:1696335:1696335 [1] NCCL INFO Using network IB
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696335:1696335 [1] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault43:1696335:1696335 [1] NCCL INFO Ring 00 : 0 -> 1 -> 2
ault43:1696335:1696335 [1] NCCL INFO Ring 01 : 0 -> 1 -> 2
ault43:1696335:1696335 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
ault43:1696335:1696335 [1] NCCL INFO P2P Chunksize set to 131072
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Channel 00/0 : 1[81000] -> 2[41000] [send] via NET/IB/0
ault43:1696335:1696335 [1] NCCL INFO Channel 01/0 : 1[81000] -> 2[41000] [send] via NET/IB/0
ault43:1696335:1696335 [1] NCCL INFO Connected all rings
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Channel 00 : 1[81000] -> 0[41000] via SHM/direct/direct
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Channel 01 : 1[81000] -> 0[41000] via SHM/direct/direct
ault43:1696335:1696335 [1] NCCL INFO Connected all trees
ault43:1696335:1696335 [1] NCCL INFO NCCL_PROTO set by environment to LL
ault43:1696335:1696335 [1] NCCL INFO NCCL_ALGO set by environment to Tree
ault43:1696335:1696335 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault43:1696335:1696335 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault43:1696335:1696335 [1] NCCL INFO comm 0x666c240 rank 1 nranks 4 cudaDev 1 busId 81000 commId 0xecdec238786580e6 - Init COMPLETE
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 1[81000] -> 2[41000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 1[81000] -> 2[41000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 3[81000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 3[81000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 1[81000] -> 3[81000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 1[81000] -> 3[81000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 2[41000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 2[41000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Channel 00 : 1[81000] -> 0[41000] via SHM/direct/direct
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Channel 01 : 1[81000] -> 0[41000] via SHM/direct/direct

The local rank is: 1
ault44:1424216:1424216 [1] NCCL INFO cudaDriverVersion 12010
ault44:1424216:1424216 [1] NCCL INFO Bootstrap : Using ib0:192.168.0.44<0>
ault44:1424216:1424216 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault44:1424216:1424216 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault44:1424216:1424216 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.44<0>
ault44:1424216:1424216 [1] NCCL INFO Using network IB
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424216:1424216 [1] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault44:1424216:1424216 [1] NCCL INFO Ring 00 : 2 -> 3 -> 0
ault44:1424216:1424216 [1] NCCL INFO Ring 01 : 2 -> 3 -> 0
ault44:1424216:1424216 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
ault44:1424216:1424216 [1] NCCL INFO P2P Chunksize set to 131072
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Channel 00/0 : 3[81000] -> 0[41000] [send] via NET/IB/0
ault44:1424216:1424216 [1] NCCL INFO Channel 01/0 : 3[81000] -> 0[41000] [send] via NET/IB/0
ault44:1424216:1424216 [1] NCCL INFO Connected all rings
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Channel 00 : 3[81000] -> 2[41000] via SHM/direct/direct
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Channel 01 : 3[81000] -> 2[41000] via SHM/direct/direct
ault44:1424216:1424216 [1] NCCL INFO Connected all trees
ault44:1424216:1424216 [1] NCCL INFO NCCL_PROTO set by environment to LL
ault44:1424216:1424216 [1] NCCL INFO NCCL_ALGO set by environment to Tree
ault44:1424216:1424216 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault44:1424216:1424216 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault44:1424216:1424216 [1] NCCL INFO comm 0x666e930 rank 3 nranks 4 cudaDev 1 busId 81000 commId 0xecdec238786580e6 - Init COMPLETE
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 3[81000] -> 0[41000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 3[81000] -> 0[41000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 1[81000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 1[81000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 3[81000] -> 1[81000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 3[81000] -> 1[81000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 0[41000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 0[41000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Channel 00 : 3[81000] -> 2[41000] via SHM/direct/direct
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Channel 01 : 3[81000] -> 2[41000] via SHM/direct/direct
sjeaugey commented 14 hours ago

I believe Geforce cards are not P2P-capable. Now, it may not be a huge deal, if you only have 2 GPUs per node and they're not connected through a PCI switch but directly to the CPU. In that case, going through memory can give better performance than P2P.