Open raninbowlalala opened 1 year ago
4c217144f0b1:15232:15245 [0] NCCL INFO Bootstrap : Using eth0:10.0.1.2<0> 4c217144f0b1:15232:15245 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation e0a95b6643f4:59249:59342 [0] NCCL INFO cudaDriverVersion 11070 e0a95b6643f4:59249:59342 [0] NCCL INFO Bootstrap : Using eth0:10.0.1.4<0> e0a95b6643f4:59249:59342 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation e0a95b6643f4:59249:59342 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.0.1.4<0> e0a95b6643f4:59249:59342 [0] NCCL INFO Using network IB e0a95b6643f4:59249:59342 [0] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7f4c0be00200 seconds: cudaHostAlloc=0.0034835 4c217144f0b1:15232:15327 [3] NCCL INFO cudaDriverVersion 11070 e0a95b6643f4:59249:59343 [3] NCCL INFO Using network IB e0a95b6643f4:59249:59343 [3] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7f4bcc000200 seconds: cudaHostAlloc=0.00306503 NCCL version 2.13.4+cuda11.7 4c217144f0b1:15232:15327 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [RO]; OOB eth0:10.0.1.2<0> 4c217144f0b1:15232:15326 [0] NCCL INFO Using network IB 4c217144f0b1:15232:15327 [3] NCCL INFO Using network IB 4c217144f0b1:15232:15326 [0] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7feb1d200200 seconds: cudaHostAlloc=0.0323954 4c217144f0b1:15232:15327 [3] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7feb00400200 seconds: cudaHostAlloc=0.0327002 4c217144f0b1:15232:15325 [1] NCCL INFO Using network IB 4c217144f0b1:15232:15325 [1] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7feae7600400 seconds: cudaHostAlloc=0.0135504 e0a95b6643f4:59249:59341 [1] NCCL INFO Using network IB e0a95b6643f4:59249:59341 [1] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7f4bb2c00200 seconds: cudaHostAlloc=0.00128145 e0a95b6643f4:59249:59340 [2] NCCL INFO Using network IB e0a95b6643f4:59249:59340 [2] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7f4b9ba00000 seconds: cudaHostAlloc=0.00011263 4c217144f0b1:15232:15324 [2] NCCL INFO Using network IB 4c217144f0b1:15232:15324 [2] NCCL INFO init.cc:327 Cuda Host Alloc Size 4 pointer 0x7fead5a00000 seconds: cudaHostAlloc=0.000147215 4c217144f0b1:15232:15327 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' 4c217144f0b1:15232:15327 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' 4c217144f0b1:15232:15327 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' 4c217144f0b1:15232:15327 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' 4c217144f0b1:15232:15327 [3] NCCL INFO transport/p2p.cc:143 Cuda Alloc Size 2097152 pointer 0x7feacb000000 seconds: cudaStreamCreateWithFlags=1.7337e-05 cudaMalloc=0.000513325 4c217144f0b1:15232:15327 [3] NCCL INFO === System : maxWidth 12.0 totalWidth 88.0 === 4c217144f0b1:15232:15327 [3] NCCL INFO CPU/0 (1/1/2) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - PCI/18000 (10b5876410b58764) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - GPU/1A000 (0) 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[44.0] - GPU/3D000 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - NIC/1C000 4c217144f0b1:15232:15327 [3] NCCL INFO + NET[12.5] - NET/0 (90b97a0003a1420c/1/12.500000) 4c217144f0b1:15232:15327 [3] NCCL INFO + NET[12.5] - NET/1 (90b97a0003a1420c/2/12.500000) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - PCI/3B000 (10b5876410b58764) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - GPU/3D000 (1) 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[22.0] - GPU/89000 4c217144f0b1:15232:15327 [3] NCCL INFO + SYS[9.0] - CPU/1 4c217144f0b1:15232:15327 [3] NCCL INFO CPU/1 (1/1/2) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - PCI/86000 (10b5876410b58764) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - GPU/89000 (2) 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[22.0] - GPU/3D000 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - NIC/8A000 4c217144f0b1:15232:15327 [3] NCCL INFO + NET[12.5] - NET/2 (48bc7a0003a1420c/1/12.500000) 4c217144f0b1:15232:15327 [3] NCCL INFO + NET[12.5] - NET/3 (48bc7a0003a1420c/2/12.500000) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - PCI/AF000 (10b5876410b58764) 4c217144f0b1:15232:15327 [3] NCCL INFO + PCI[12.0] - GPU/B2000 (3) 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[44.0] - GPU/89000 4c217144f0b1:15232:15327 [3] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15327 [3] NCCL INFO + SYS[9.0] - CPU/0 4c217144f0b1:15232:15327 [3] NCCL INFO ========================================== 4c217144f0b1:15232:15327 [3] NCCL INFO GPU/1A000 :GPU/1A000 (0/5000.000000/LOC) GPU/3D000 (1/44.000000/NVL) GPU/89000 (2/44.000000/NVB) GPU/B2000 (1/44.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) 4c217144f0b1:15232:15327 [3] NCCL INFO GPU/3D000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (0/5000.000000/LOC) GPU/89000 (1/22.000000/NVL) GPU/B2000 (2/44.000000/NVB) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) 4c217144f0b1:15232:15327 [3] NCCL INFO GPU/89000 :GPU/1A000 (2/44.000000/NVB) GPU/3D000 (1/22.000000/NVL) GPU/89000 (0/5000.000000/LOC) GPU/B2000 (1/44.000000/NVL) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) 4c217144f0b1:15232:15327 [3] NCCL INFO GPU/B2000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (2/44.000000/NVB) GPU/89000 (1/44.000000/NVL) GPU/B2000 (0/5000.000000/LOC) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) 4c217144f0b1:15232:15327 [3] NCCL INFO NET/0 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/12.500000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) e0a95b6643f4:59249:59340 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' 4c217144f0b1:15232:15327 [3] NCCL INFO NET/1 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (2/12.500000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) 4c217144f0b1:15232:15327 [3] NCCL INFO NET/2 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/12.500000/LOC) 4c217144f0b1:15232:15327 [3] NCCL INFO NET/3 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (2/12.500000/LOC) NET/3 (0/5000.000000/LOC) e0a95b6643f4:59249:59340 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' e0a95b6643f4:59249:59340 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' e0a95b6643f4:59249:59340 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' 4c217144f0b1:15232:15324 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' 4c217144f0b1:15232:15324 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' e0a95b6643f4:59249:59340 [2] NCCL INFO transport/p2p.cc:143 Cuda Alloc Size 2097152 pointer 0x7f4b98800000 seconds: cudaStreamCreateWithFlags=1.3753e-05 cudaMalloc=0.000415117 4c217144f0b1:15232:15324 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' 4c217144f0b1:15232:15324 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' e0a95b6643f4:59249:59340 [2] NCCL INFO === System : maxWidth 12.0 totalWidth 88.0 === e0a95b6643f4:59249:59340 [2] NCCL INFO CPU/0 (1/1/2) 4c217144f0b1:15232:15324 [2] NCCL INFO === System : maxWidth 12.0 totalWidth 88.0 === 4c217144f0b1:15232:15324 [2] NCCL INFO CPU/0 (1/1/2) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - PCI/18000 (10b5876410b58764) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - GPU/1A000 (0) 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[44.0] - GPU/3D000 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - NIC/1C000 4c217144f0b1:15232:15324 [2] NCCL INFO + NET[12.5] - NET/0 (90b97a0003a1420c/1/12.500000) 4c217144f0b1:15232:15324 [2] NCCL INFO + NET[12.5] - NET/1 (90b97a0003a1420c/2/12.500000) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - PCI/3B000 (10b5876410b58764) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - GPU/3D000 (1) 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[22.0] - GPU/89000 e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - PCI/18000 (10b5876410b58764) e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - GPU/1A000 (4) e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15324 [2] NCCL INFO + SYS[9.0] - CPU/1 4c217144f0b1:15232:15324 [2] NCCL INFO CPU/1 (1/1/2) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - PCI/86000 (10b5876410b58764) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - GPU/89000 (2) e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[44.0] - GPU/3D000 e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - NIC/1C000 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[22.0] - GPU/3D000 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - NIC/8A000 4c217144f0b1:15232:15324 [2] NCCL INFO + NET[12.5] - NET/2 (48bc7a0003a1420c/1/12.500000) 4c217144f0b1:15232:15324 [2] NCCL INFO + NET[12.5] - NET/3 (48bc7a0003a1420c/2/12.500000) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - PCI/AF000 (10b5876410b58764) 4c217144f0b1:15232:15324 [2] NCCL INFO + PCI[12.0] - GPU/B2000 (3) 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[44.0] - GPU/89000 4c217144f0b1:15232:15324 [2] NCCL INFO + NVL[44.0] - GPU/1A000 e0a95b6643f4:59249:59340 [2] NCCL INFO + NET[12.5] - NET/0 (b8b77a0003a1420c/1/12.500000) e0a95b6643f4:59249:59340 [2] NCCL INFO + NET[12.5] - NET/1 (b8b77a0003a1420c/2/12.500000) e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - PCI/3B000 (10b5876410b58764) e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - GPU/3D000 (5) e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15324 [2] NCCL INFO + SYS[9.0] - CPU/0 4c217144f0b1:15232:15324 [2] NCCL INFO ========================================== 4c217144f0b1:15232:15324 [2] NCCL INFO GPU/1A000 :GPU/1A000 (0/5000.000000/LOC) GPU/3D000 (1/44.000000/NVL) GPU/89000 (2/44.000000/NVB) GPU/B2000 (1/44.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[22.0] - GPU/89000 4c217144f0b1:15232:15324 [2] NCCL INFO GPU/3D000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (0/5000.000000/LOC) GPU/89000 (1/22.000000/NVL) GPU/B2000 (2/44.000000/NVB) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) 4c217144f0b1:15232:15324 [2] NCCL INFO GPU/89000 :GPU/1A000 (2/44.000000/NVB) GPU/3D000 (1/22.000000/NVL) GPU/89000 (0/5000.000000/LOC) GPU/B2000 (1/44.000000/NVL) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) e0a95b6643f4:59249:59340 [2] NCCL INFO + SYS[9.0] - CPU/1 e0a95b6643f4:59249:59340 [2] NCCL INFO CPU/1 (1/1/2) e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - PCI/86000 (10b5876410b58764) e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - GPU/89000 (6) e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[44.0] - GPU/B2000 e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[22.0] - GPU/3D000 e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - NIC/8A000 4c217144f0b1:15232:15324 [2] NCCL INFO GPU/B2000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (2/44.000000/NVB) GPU/89000 (1/44.000000/NVL) GPU/B2000 (0/5000.000000/LOC) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) 4c217144f0b1:15232:15324 [2] NCCL INFO NET/0 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/12.500000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) e0a95b6643f4:59249:59340 [2] NCCL INFO + NET[12.5] - NET/2 (88b97a0003a1420c/1/12.500000) e0a95b6643f4:59249:59340 [2] NCCL INFO + NET[12.5] - NET/3 (88b97a0003a1420c/2/12.500000) e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - PCI/AF000 (10b5876410b58764) e0a95b6643f4:59249:59340 [2] NCCL INFO + PCI[12.0] - GPU/B2000 (7) e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[44.0] - GPU/89000 e0a95b6643f4:59249:59340 [2] NCCL INFO + NVL[44.0] - GPU/1A000 e0a95b6643f4:59249:59340 [2] NCCL INFO + SYS[9.0] - CPU/0 e0a95b6643f4:59249:59340 [2] NCCL INFO ========================================== e0a95b6643f4:59249:59340 [2] NCCL INFO GPU/1A000 :GPU/1A000 (0/5000.000000/LOC) GPU/3D000 (1/44.000000/NVL) GPU/89000 (2/44.000000/NVB) GPU/B2000 (1/44.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) 4c217144f0b1:15232:15324 [2] NCCL INFO NET/1 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (2/12.500000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) 4c217144f0b1:15232:15324 [2] NCCL INFO NET/2 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/12.500000/LOC) e0a95b6643f4:59249:59340 [2] NCCL INFO GPU/3D000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (0/5000.000000/LOC) GPU/89000 (1/22.000000/NVL) GPU/B2000 (2/44.000000/NVB) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) e0a95b6643f4:59249:59340 [2] NCCL INFO GPU/89000 :GPU/1A000 (2/44.000000/NVB) GPU/3D000 (1/22.000000/NVL) GPU/89000 (0/5000.000000/LOC) GPU/B2000 (1/44.000000/NVL) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) e0a95b6643f4:59249:59340 [2] NCCL INFO GPU/B2000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (2/44.000000/NVB) GPU/89000 (1/44.000000/NVL) GPU/B2000 (0/5000.000000/LOC) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) 4c217144f0b1:15232:15324 [2] NCCL INFO NET/3 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (2/12.500000/LOC) NET/3 (0/5000.000000/LOC) e0a95b6643f4:59249:59340 [2] NCCL INFO NET/0 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/12.500000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) e0a95b6643f4:59249:59340 [2] NCCL INFO NET/1 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (2/12.500000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) e0a95b6643f4:59249:59340 [2] NCCL INFO NET/2 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/12.500000/LOC) e0a95b6643f4:59249:59340 [2] NCCL INFO NET/3 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (2/12.500000/LOC) NET/3 (0/5000.000000/LOC) e0a95b6643f4:59249:59343 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' e0a95b6643f4:59249:59343 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' e0a95b6643f4:59249:59343 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' e0a95b6643f4:59249:59343 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' e0a95b6643f4:59249:59343 [3] NCCL INFO === System : maxWidth 12.0 totalWidth 88.0 === e0a95b6643f4:59249:59343 [3] NCCL INFO CPU/0 (1/1/2) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - PCI/18000 (10b5876410b58764) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - GPU/1A000 (4) e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[44.0] - GPU/B2000 e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[44.0] - GPU/3D000 e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - NIC/1C000 e0a95b6643f4:59249:59343 [3] NCCL INFO + NET[12.5] - NET/0 (b8b77a0003a1420c/1/12.500000) e0a95b6643f4:59249:59343 [3] NCCL INFO + NET[12.5] - NET/1 (b8b77a0003a1420c/2/12.500000) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - PCI/3B000 (10b5876410b58764) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - GPU/3D000 (5) e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[44.0] - GPU/1A000 e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[22.0] - GPU/89000 e0a95b6643f4:59249:59343 [3] NCCL INFO + SYS[9.0] - CPU/1 e0a95b6643f4:59249:59343 [3] NCCL INFO CPU/1 (1/1/2) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - PCI/86000 (10b5876410b58764) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - GPU/89000 (6) e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[44.0] - GPU/B2000 e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[22.0] - GPU/3D000 e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - NIC/8A000 e0a95b6643f4:59249:59343 [3] NCCL INFO + NET[12.5] - NET/2 (88b97a0003a1420c/1/12.500000) e0a95b6643f4:59249:59343 [3] NCCL INFO + NET[12.5] - NET/3 (88b97a0003a1420c/2/12.500000) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - PCI/AF000 (10b5876410b58764) e0a95b6643f4:59249:59343 [3] NCCL INFO + PCI[12.0] - GPU/B2000 (7) e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[44.0] - GPU/89000 e0a95b6643f4:59249:59343 [3] NCCL INFO + NVL[44.0] - GPU/1A000 e0a95b6643f4:59249:59343 [3] NCCL INFO + SYS[9.0] - CPU/0 e0a95b6643f4:59249:59343 [3] NCCL INFO ========================================== e0a95b6643f4:59249:59343 [3] NCCL INFO GPU/1A000 :GPU/1A000 (0/5000.000000/LOC) GPU/3D000 (1/44.000000/NVL) GPU/89000 (2/44.000000/NVB) GPU/B2000 (1/44.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) e0a95b6643f4:59249:59343 [3] NCCL INFO GPU/3D000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (0/5000.000000/LOC) GPU/89000 (1/22.000000/NVL) GPU/B2000 (2/44.000000/NVB) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) e0a95b6643f4:59249:59343 [3] NCCL INFO GPU/89000 :GPU/1A000 (2/44.000000/NVB) GPU/3D000 (1/22.000000/NVL) GPU/89000 (0/5000.000000/LOC) GPU/B2000 (1/44.000000/NVL) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) e0a95b6643f4:59249:59343 [3] NCCL INFO GPU/B2000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (2/44.000000/NVB) GPU/89000 (1/44.000000/NVL) GPU/B2000 (0/5000.000000/LOC) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) e0a95b6643f4:59249:59343 [3] NCCL INFO NET/0 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/12.500000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) e0a95b6643f4:59249:59343 [3] NCCL INFO NET/1 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (2/12.500000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) e0a95b6643f4:59249:59343 [3] NCCL INFO NET/2 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/12.500000/LOC) e0a95b6643f4:59249:59343 [3] NCCL INFO NET/3 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (2/12.500000/LOC) NET/3 (0/5000.000000/LOC) e0a95b6643f4:59249:59340 [2] NCCL INFO Pattern 4, crossNic 1, nChannels 1, speed 10.000000/10.000000, type NVL/PHB, sameChannels 1 e0a95b6643f4:59249:59340 [2] NCCL INFO 0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/3 4c217144f0b1:15232:15325 [1] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' 4c217144f0b1:15232:15325 [1] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' e0a95b6643f4:59249:59340 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 22.000000/10.000000, type NVL/SYS, sameChannels 1 e0a95b6643f4:59249:59340 [2] NCCL INFO 0 : NET/0 GPU/5 GPU/6 GPU/7 GPU/4 NET/0 4c217144f0b1:15232:15325 [1] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' 4c217144f0b1:15232:15325 [1] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' 4c217144f0b1:15232:15326 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' e0a95b6643f4:59249:59340 [2] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1 4c217144f0b1:15232:15326 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' 4c217144f0b1:15232:15325 [1] NCCL INFO === System : maxWidth 12.0 totalWidth 88.0 === 4c217144f0b1:15232:15325 [1] NCCL INFO CPU/0 (1/1/2) 4c217144f0b1:15232:15326 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - PCI/18000 (10b5876410b58764) 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - GPU/1A000 (0) 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[44.0] - GPU/3D000 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - NIC/1C000 4c217144f0b1:15232:15325 [1] NCCL INFO + NET[12.5] - NET/0 (90b97a0003a1420c/1/12.500000) 4c217144f0b1:15232:15325 [1] NCCL INFO + NET[12.5] - NET/1 (90b97a0003a1420c/2/12.500000) 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - PCI/3B000 (10b5876410b58764) 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - GPU/3D000 (1) 4c217144f0b1:15232:15326 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[22.0] - GPU/89000 4c217144f0b1:15232:15325 [1] NCCL INFO + SYS[9.0] - CPU/1 4c217144f0b1:15232:15325 [1] NCCL INFO CPU/1 (1/1/2) 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - PCI/86000 (10b5876410b58764) 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - GPU/89000 (2) 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[22.0] - GPU/3D000 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - NIC/8A000 4c217144f0b1:15232:15325 [1] NCCL INFO + NET[12.5] - NET/2 (48bc7a0003a1420c/1/12.500000) 4c217144f0b1:15232:15325 [1] NCCL INFO + NET[12.5] - NET/3 (48bc7a0003a1420c/2/12.500000) 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - PCI/AF000 (10b5876410b58764) 4c217144f0b1:15232:15325 [1] NCCL INFO + PCI[12.0] - GPU/B2000 (3) 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[44.0] - GPU/89000 4c217144f0b1:15232:15325 [1] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15325 [1] NCCL INFO + SYS[9.0] - CPU/0 4c217144f0b1:15232:15325 [1] NCCL INFO ========================================== 4c217144f0b1:15232:15325 [1] NCCL INFO GPU/1A000 :GPU/1A000 (0/5000.000000/LOC) GPU/3D000 (1/44.000000/NVL) GPU/89000 (2/44.000000/NVB) GPU/B2000 (1/44.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) 4c217144f0b1:15232:15325 [1] NCCL INFO GPU/3D000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (0/5000.000000/LOC) GPU/89000 (1/22.000000/NVL) GPU/B2000 (2/44.000000/NVB) CPU/0 (2/12.000000/PHB) CPU/1 (3/9.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (5/12.000000/PHB) NET/2 (6/9.000000/SYS) NET/3 (6/9.000000/SYS) 4c217144f0b1:15232:15325 [1] NCCL INFO GPU/89000 :GPU/1A000 (2/44.000000/NVB) GPU/3D000 (1/22.000000/NVL) GPU/89000 (0/5000.000000/LOC) GPU/B2000 (1/44.000000/NVL) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) 4c217144f0b1:15232:15325 [1] NCCL INFO GPU/B2000 :GPU/1A000 (1/44.000000/NVL) GPU/3D000 (2/44.000000/NVB) GPU/89000 (1/44.000000/NVL) GPU/B2000 (0/5000.000000/LOC) CPU/0 (3/9.000000/SYS) CPU/1 (2/12.000000/PHB) NET/0 (6/9.000000/SYS) NET/1 (6/9.000000/SYS) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) 4c217144f0b1:15232:15325 [1] NCCL INFO NET/0 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/12.500000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) 4c217144f0b1:15232:15325 [1] NCCL INFO NET/1 :GPU/1A000 (5/12.000000/PHB) GPU/3D000 (5/12.000000/PHB) GPU/89000 (6/9.000000/SYS) GPU/B2000 (6/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (2/12.500000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (7/9.000000/SYS) NET/3 (7/9.000000/SYS) 4c217144f0b1:15232:15325 [1] NCCL INFO NET/2 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/12.500000/LOC) 4c217144f0b1:15232:15325 [1] NCCL INFO NET/3 :GPU/1A000 (6/9.000000/SYS) GPU/3D000 (6/9.000000/SYS) GPU/89000 (5/12.000000/PHB) GPU/B2000 (5/12.000000/PHB) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (7/9.000000/SYS) NET/1 (7/9.000000/SYS) NET/2 (2/12.500000/LOC) NET/3 (0/5000.000000/LOC) 4c217144f0b1:15232:15325 [1] NCCL INFO Setting affinity for GPU 1 to 010000,00000001 e0a95b6643f4:59249:59342 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' e0a95b6643f4:59249:59342 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' 4c217144f0b1:15232:15326 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 88.0 === 4c217144f0b1:15232:15326 [0] NCCL INFO CPU/0 (1/1/2) 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - PCI/18000 (10b5876410b58764) 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - GPU/1A000 (0) 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[44.0] - GPU/3D000 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - NIC/1C000 4c217144f0b1:15232:15326 [0] NCCL INFO + NET[12.5] - NET/0 (90b97a0003a1420c/1/12.500000) 4c217144f0b1:15232:15326 [0] NCCL INFO + NET[12.5] - NET/1 (90b97a0003a1420c/2/12.500000) 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - PCI/3B000 (10b5876410b58764) 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - GPU/3D000 (1) 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[22.0] - GPU/89000 4c217144f0b1:15232:15326 [0] NCCL INFO + SYS[9.0] - CPU/1 4c217144f0b1:15232:15326 [0] NCCL INFO CPU/1 (1/1/2) e0a95b6643f4:59249:59342 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - PCI/86000 (10b5876410b58764) 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - GPU/89000 (2) 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[44.0] - GPU/B2000 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[22.0] - GPU/3D000 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - NIC/8A000 4c217144f0b1:15232:15326 [0] NCCL INFO + NET[12.5] - NET/2 (48bc7a0003a1420c/1/12.500000) e0a95b6643f4:59249:59342 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' 4c217144f0b1:15232:15326 [0] NCCL INFO + NET[12.5] - NET/3 (48bc7a0003a1420c/2/12.500000) 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - PCI/AF000 (10b5876410b58764) 4c217144f0b1:15232:15326 [0] NCCL INFO + PCI[12.0] - GPU/B2000 (3) 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[44.0] - GPU/89000 4c217144f0b1:15232:15326 [0] NCCL INFO + NVL[44.0] - GPU/1A000 4c217144f0b1:15232:15326 [0] NCCL INFO + SYS[9.0] - CPU/0 4c217144f0b1:15232:15326 [0] NCCL INFO ==========================================
Hello, I want to use two GPU non-blocking streams for communication and cuMemcpyAsync respectively to accelerate. GPU: V100 32GB NCCL:NCCL version 2.13.4+cuda11.7 and I use IB. I mean does nccl use same hardware with cuMemcpy (such as DMA)? Does it OK?