Open gim4moon opened 9 months ago
That output doesn't look correct. Did you compile the nccl-tests with MPI=1
?
Once you've corrected that it would be good to see logs with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH
hello
Thanks for your reply!
After entering the information as instructed, we will share the results with you!
The results seem to be similar.
root@testgpu1:/mpi-nfs/nccl-tests# mpirun -x MPI=1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_IB_HCA=mlx5_0 -x NCCL_NET_GDR_LEVEL=5 -x NCCL_SHM_DISABLE=1 -x NCCL_IB_MERGE_VFS=0 -x NCCL_IGNORE_DISABLED_P2P=1 -x NCCL_IB_DISABLE=0 -np 2 --allow-run-as-root -H testgpu1,testgpu2 ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2 nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 raph: 0
Using devices Rank 0 Group 0 Pid 3255 on testgpu2 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 3255 on testgpu2 device 1 [0x03] NVIDIA A100-PCIE-40GB Rank 0 Group 0 Pid 2832 on testgpu1 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 2832 on testgpu1 device 1 [0x03] NVIDIA A100-PCIE-40GB testgpu2:3255:3255 [0] NCCL INFO Bootstrap : Using ibs65:10.10.10.101<0> testgpu2:3255:3255 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation testgpu1:2832:2832 [0] NCCL INFO Bootstrap : Using ibs65:10.10.10.100<0> testgpu1:2832:2832 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation testgpu2:3255:3255 [1] NCCL INFO cudaDriverVersion 12030 NCCL version 2.19.3+cuda12.3 testgpu1:2832:2832 [1] NCCL INFO cudaDriverVersion 12030 NCCL version 2.19.3+cuda12.3 testgpu2:3255:3262 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. testgpu2:3255:3262 [1] NCCL INFO NCCL_IB_HCA set to mlx5_0 testgpu2:3255:3262 [1] NCCL INFO NCCL_IB_MERGE_VFS set by environment to 0. testgpu2:3255:3262 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs65:10.10.10.101<0> testgpu2:3255:3262 [1] NCCL INFO Using non-device net plugin version 0 testgpu2:3255:3262 [1] NCCL INFO Using network IB testgpu2:3255:3261 [0] NCCL INFO Using non-device net plugin version 0 testgpu2:3255:3261 [0] NCCL INFO Using network IB testgpu2:3255:3261 [0] NCCL INFO comm 0x55bc161d6b40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0xe5eea11d31e0cffc - Init START testgpu2:3255:3262 [1] NCCL INFO comm 0x55bc161db6b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0xe5eea11d31e0cffc - Init START testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu1:2832:2839 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. testgpu1:2832:2839 [1] NCCL INFO NCCL_IB_HCA set to mlx5_0 testgpu1:2832:2839 [1] NCCL INFO NCCL_IB_MERGE_VFS set by environment to 0. testgpu1:2832:2839 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs65:10.10.10.100<0> testgpu1:2832:2839 [1] NCCL INFO Using non-device net plugin version 0 testgpu1:2832:2839 [1] NCCL INFO Using network IB testgpu1:2832:2838 [0] NCCL INFO Using non-device net plugin version 0 testgpu1:2832:2838 [0] NCCL INFO Using network IB testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu1:2832:2838 [0] NCCL INFO comm 0x55c6f1c65b50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0x5c31d9a008352187 - Init START testgpu1:2832:2839 [1] NCCL INFO comm 0x55c6f1c6a6c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0x5c31d9a008352187 - Init START testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu2:3255:3262 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3262 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3262 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3262 [1] NCCL INFO NCCL_IGNORE_DISABLED_P2P set by environment to 1. testgpu2:3255:3262 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 1. testgpu2:3255:3262 [1] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS testgpu2:3255:3262 [1] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu2:3255:3262 [1] NCCL INFO CPU/0 (1/2/-1) testgpu2:3255:3262 [1] NCCL INFO + PCI[12.0] - NIC/3010 testgpu2:3255:3262 [1] NCCL INFO + NET[25.0] - NET/0 (c2ed3f0003ebc008/1/25.000000) testgpu2:3255:3262 [1] NCCL INFO + PCI[12.0] - GPU/3020 (1) testgpu2:3255:3262 [1] NCCL INFO ========================================== testgpu2:3255:3262 [1] NCCL INFO GPU/3020 :GPU/3020 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu2:3255:3262 [1] NCCL INFO NET/0 :GPU/3020 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu2:3255:3262 [1] NCCL INFO Setting affinity for GPU 1 to 01 testgpu2:3255:3262 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3262 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu2:3255:3262 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3262 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3261 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3261 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu2:3255:3261 [0] NCCL INFO CPU/0 (1/2/-1) testgpu2:3255:3261 [0] NCCL INFO + PCI[12.0] - GPU/3000 (0) testgpu2:3255:3261 [0] NCCL INFO + PCI[12.0] - NIC/3010 testgpu2:3255:3261 [0] NCCL INFO + NET[25.0] - NET/0 (c2ed3f0003ebc008/1/25.000000) testgpu2:3255:3261 [0] NCCL INFO ========================================== testgpu2:3255:3261 [0] NCCL INFO GPU/3000 :GPU/3000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu2:3255:3261 [0] NCCL INFO NET/0 :GPU/3000 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu2:3255:3261 [0] NCCL INFO Setting affinity for GPU 0 to 01 testgpu2:3255:3261 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3261 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu2:3255:3261 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3261 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu2:3255:3262 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 testgpu2:3255:3262 [1] NCCL INFO Tree 1 : -1 -> 1 -> 0/-1/-1 testgpu2:3255:3262 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 testgpu2:3255:3262 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 testgpu2:3255:3262 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 testgpu2:3255:3262 [1] NCCL INFO P2P Chunksize set to 131072 testgpu2:3255:3261 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 testgpu2:3255:3261 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1 testgpu2:3255:3261 [0] NCCL INFO Channel 00/02 : 0 1 testgpu2:3255:3261 [0] NCCL INFO Channel 01/02 : 0 1 testgpu2:3255:3261 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 testgpu2:3255:3261 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 testgpu2:3255:3261 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 testgpu2:3255:3261 [0] NCCL INFO P2P Chunksize set to 131072 testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3261 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3261 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu2:3255:3261 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Connected all rings testgpu2:3255:3262 [1] NCCL INFO Connected all trees testgpu2:3255:3262 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu2:3255:3262 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu2:3255:3261 [0] NCCL INFO Connected all rings testgpu2:3255:3261 [0] NCCL INFO Connected all trees testgpu2:3255:3261 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu2:3255:3261 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu1:2832:2839 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO NCCL_IGNORE_DISABLED_P2P set by environment to 1. testgpu1:2832:2839 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 1. testgpu1:2832:2839 [1] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS testgpu1:2832:2839 [1] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu1:2832:2839 [1] NCCL INFO CPU/0 (1/2/-1) testgpu1:2832:2839 [1] NCCL INFO + PCI[12.0] - NIC/3010 testgpu1:2832:2839 [1] NCCL INFO + NET[25.0] - NET/0 (e6ea3f0003ebc008/1/25.000000) testgpu1:2832:2839 [1] NCCL INFO + PCI[12.0] - GPU/3020 (1) testgpu1:2832:2839 [1] NCCL INFO ========================================== testgpu1:2832:2839 [1] NCCL INFO GPU/3020 :GPU/3020 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu1:2832:2839 [1] NCCL INFO NET/0 :GPU/3020 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu1:2832:2839 [1] NCCL INFO Setting affinity for GPU 1 to 01 testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu1:2832:2839 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2839 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu1:2832:2839 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2838 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2838 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2838 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu1:2832:2838 [0] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu1:2832:2838 [0] NCCL INFO CPU/0 (1/2/-1) testgpu1:2832:2838 [0] NCCL INFO + PCI[12.0] - GPU/3000 (0) testgpu1:2832:2838 [0] NCCL INFO + PCI[12.0] - NIC/3010 testgpu1:2832:2838 [0] NCCL INFO + NET[25.0] - NET/0 (e6ea3f0003ebc008/1/25.000000) testgpu1:2832:2838 [0] NCCL INFO ========================================== testgpu1:2832:2838 [0] NCCL INFO GPU/3000 :GPU/3000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu1:2832:2838 [0] NCCL INFO NET/0 :GPU/3000 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu1:2832:2838 [0] NCCL INFO Setting affinity for GPU 0 to 01 testgpu1:2832:2838 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2838 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu1:2832:2838 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2838 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu1:2832:2838 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 testgpu1:2832:2838 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1 testgpu1:2832:2839 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 testgpu1:2832:2839 [1] NCCL INFO Tree 1 : -1 -> 1 -> 0/-1/-1 testgpu1:2832:2839 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 testgpu1:2832:2839 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 testgpu1:2832:2839 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 testgpu1:2832:2839 [1] NCCL INFO P2P Chunksize set to 131072 testgpu1:2832:2838 [0] NCCL INFO Channel 00/02 : 0 1 testgpu1:2832:2838 [0] NCCL INFO Channel 01/02 : 0 1 testgpu1:2832:2838 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 testgpu1:2832:2838 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 testgpu1:2832:2838 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 testgpu1:2832:2838 [0] NCCL INFO P2P Chunksize set to 131072 testgpu1:2832:2838 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2838 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2838 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu1:2832:2838 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO comm 0x55bc161db6b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0xe5eea11d31e0cffc - Init COMPLETE testgpu2:3255:3261 [0] NCCL INFO comm 0x55bc161d6b40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0xe5eea11d31e0cffc - Init COMPLETE
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
testgpu1:2832:2839 [1] NCCL INFO Connected all rings testgpu1:2832:2839 [1] NCCL INFO Connected all trees testgpu1:2832:2838 [0] NCCL INFO Connected all rings testgpu1:2832:2838 [0] NCCL INFO Connected all trees testgpu1:2832:2839 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu1:2832:2839 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu1:2832:2838 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu1:2832:2838 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu1:2832:2838 [0] NCCL INFO comm 0x55c6f1c65b50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0x5c31d9a008352187 - Init COMPLETE testgpu1:2832:2839 [1] NCCL INFO comm 0x55c6f1c6a6c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0x5c31d9a008352187 - Init COMPLETE
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 40.64 0.00 0.00 0 45.02 0.00 0.00 0
16 4 float sum -1 49.66 0.00 0.00 0 48.21 0.00 0.00 0
32 8 float sum -1 47.28 0.00 0.00 0 48.67 0.00 0.00 0
64 16 float sum -1 47.85 0.00 0.00 0 47.38 0.00 0.00 0
128 32 float sum -1 47.76 0.00 0.00 0 48.64 0.00 0.00 0
256 64 float sum -1 48.23 0.01 0.01 0 47.77 0.01 0.01 0
8 2 float sum -1 46.80 0.00 0.00 0 56.22 0.00 0.00 0
512 128 float sum -1 47.44 0.01 0.01 0 48.22 0.01 0.01 0
16 4 float sum -1 59.49 0.00 0.00 0 57.71 0.00 0.00 0
1024 256 float sum -1 47.44 0.02 0.02 0 53.56 0.02 0.02 0
32 8 float sum -1 56.84 0.00 0.00 0 58.08 0.00 0.00 0
2048 512 float sum -1 50.01 0.04 0.04 0 48.32 0.04 0.04 0
64 16 float sum -1 57.87 0.00 0.00 0 57.77 0.00 0.00 0
4096 1024 float sum -1 47.37 0.09 0.09 0 47.15 0.09 0.09 0
128 32 float sum -1 58.60 0.00 0.00 0 58.63 0.00 0.00 0
8192 2048 float sum -1 47.34 0.17 0.17 0 48.44 0.17 0.17 0
256 64 float sum -1 59.60 0.00 0.00 0 58.33 0.00 0.00 0
16384 4096 float sum -1 55.38 0.30 0.30 0 55.06 0.30 0.30 0
512 128 float sum -1 56.52 0.01 0.01 0 59.72 0.01 0.01 0
1024 256 float sum -1 56.78 0.02 0.02 0 58.03 0.02 0.02 0
32768 8192 float sum -1 77.52 0.42 0.42 0 82.21 0.40 0.40 0
2048 512 float sum -1 58.29 0.04 0.04 0 56.97 0.04 0.04 0
4096 1024 float sum -1 57.71 0.07 0.07 0 58.00 0.07 0.07 0
8192 2048 float sum -1 60.02 0.14 0.14 0 59.03 0.14 0.14 0
16384 4096 float sum -1 80.93 0.20 0.20 0 81.88 0.20 0.20 0
65536 16384 float sum -1 156.4 0.42 0.42 0 157.0 0.42 0.42 0
32768 8192 float sum -1 113.1 0.29 0.29 0 111.1 0.29 0.29 0
131072 32768 float sum -1 306.9 0.43 0.43 0 311.4 0.42 0.42 0
65536 16384 float sum -1 202.6 0.32 0.32 0 206.6 0.32 0.32 0
131072 32768 float sum -1 379.7 0.35 0.35 0 358.5 0.37 0.37 0
262144 65536 float sum -1 139.4 1.88 1.88 0 139.7 1.88 1.88 0
524288 131072 float sum -1 210.6 2.49 2.49 0 201.7 2.60 2.60 0
262144 65536 float sum -1 148.8 1.76 1.76 0 153.8 1.70 1.70 0
1048576 262144 float sum -1 325.3 3.22 3.22 0 338.9 3.09 3.09 0
524288 131072 float sum -1 212.0 2.47 2.47 0 217.1 2.41 2.41 0
2097152 524288 float sum -1 565.9 3.71 3.71 0 556.9 3.77 3.77 0
1048576 262144 float sum -1 339.0 3.09 3.09 0 307.6 3.41 3.41 0
4194304 1048576 float sum -1 1082.5 3.87 3.87 0 1023.0 4.10 4.10 0
2097152 524288 float sum -1 557.0 3.77 3.77 0 536.7 3.91 3.91 0
4194304 1048576 float sum -1 1019.8 4.11 4.11 0 942.2 4.45 4.45 0
8388608 2097152 float sum -1 2001.8 4.19 4.19 0 1978.8 4.24 4.24 0
8388608 2097152 float sum -1 1869.2 4.49 4.49 0 1814.3 4.62 4.62 0
16777216 4194304 float sum -1 3863.2 4.34 4.34 0 3865.6 4.34 4.34 0
16777216 4194304 float sum -1 3564.1 4.71 4.71 0 3552.0 4.72 4.72 0
33554432 8388608 float sum -1 7574.1 4.43 4.43 0 7581.0 4.43 4.43 0
33554432 8388608 float sum -1 7022.9 4.78 4.78 0 7030.6 4.77 4.77 0
67108864 16777216 float sum -1 14979 4.48 4.48 0 14999 4.47 4.47 0
67108864 16777216 float sum -1 13870 4.84 4.84 0 13872 4.84 4.84 0
134217728 33554432 float sum -1 29770 4.51 4.51 0 29873 4.49 4.49 0 134217728 33554432 float sum -1 27666 4.85 4.85 0 27662 4.85 4.85 0 268435456 67108864 float sum -1 55270 4.86 4.86 0 55213 4.86 4.86 0 testgpu1:2832:2832 [1] NCCL INFO comm 0x55c6f1c65b50 rank 0 nranks 2 cudaDev 0 busId 3000 - Destroy COMPLETE testgpu1:2832:2832 [1] NCCL INFO comm 0x55c6f1c6a6c0 rank 1 nranks 2 cudaDev 1 busId 3020 - Destroy COMPLETE Out of bounds values : 0 OK Avg bus bandwidth : 1.7535
268435456 67108864 float sum -1 59440 4.52 4.52 0 59568 4.51 4.51 0 testgpu2:3255:3255 [1] NCCL INFO comm 0x55bc161d6b40 rank 0 nranks 2 cudaDev 0 busId 3000 - Destroy COMPLETE testgpu2:3255:3255 [1] NCCL INFO comm 0x55bc161db6b0 rank 1 nranks 2 cudaDev 1 busId 3020 - Destroy COMPLETE Out of bounds values : 0 OK Avg bus bandwidth : 1.6796
The comment said "compile the nccl-tests with MPI=1". Not run. So please run make clean, then run make again with MPI=1. As instructed in the readme (you should probably read it again, it's not long).
NCCL_HOME=/path/to/nccl In which directory is usually installed?
I installed libnccl2 and libnccl-dev, but I can't find the directories.
root@testgpu1:/nccl-tests# mpirun -x NCCL_DEBUG=WARN -x NCCL_IB_HCA=mlx5_0 -x NCCL_NET_GDR_LEVEL=5 -x NCCL_SHM_DISABLE=1 -x NCCL_IB_MERGE_VFS=0 -x NCCL_IB_DISABLE=0 -np 2 --allow-run-as-root -H testgpu1,testgpu2 ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2 nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices Rank 0 Group 0 Pid 3591 on testgpu1 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 3591 on testgpu1 device 1 [0x03] NVIDIA A100-PCIE-40GB Rank 0 Group 0 Pid 2451 on testgpu2 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 2451 on testgpu2 device 1 [0x03] NVIDIA A100-PCIE-40GB NCCL version 2.19.3+cuda12.3 NCCL version 2.19.3+cuda12.3
134217728 33554432 float sum -1 27691 4.85 4.85 0 27634 4.86 4.86 0 134217728 33554432 float sum -1 29708 4.52 4.52 0 29694 4.52 4.52 0 268435456 67108864 float sum -1 55225 4.86 4.86 0 55242 4.86 4.86 0 Out of bounds values : 0 OK Avg bus bandwidth : 1.80135
268435456 67108864 float sum -1 59067 4.54 4.54 0 59082 4.54 4.54 0 Out of bounds values : 0 OK Avg bus bandwidth : 1.65866
root@testgpu1:/nccl-tests#
We are conducting a test by configuring each virtual machine with 2 A100 HDR InfiniBand 1 EA pass-through. I'm asking because I don't think the speed is as fast as I expected.
Am I missing something?