Open TeddLi opened 8 months ago
Also I passed local NCCL test
root@g3-xlarge-x86-dal-1:/home/ubuntu/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 0 [0x01] NVIDIA H100 PCIe
# Rank 1 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 1 [0x21] NVIDIA H100 PCIe
# Rank 2 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 2 [0x41] NVIDIA H100 PCIe
# Rank 3 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 3 [0x61] NVIDIA H100 PCIe
# Rank 4 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 4 [0x81] NVIDIA H100 PCIe
# Rank 5 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 5 [0xa1] NVIDIA H100 PCIe
# Rank 6 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 6 [0xc1] NVIDIA H100 PCIe
# Rank 7 Group 0 Pid 3096 on g3-xlarge-x86-dal-1 device 7 [0xe1] NVIDIA H100 PCIe
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 33.98 0.00 0.00 0 23.26 0.00 0.00 0
16 4 float sum -1 23.09 0.00 0.00 0 23.29 0.00 0.00 0
32 8 float sum -1 23.10 0.00 0.00 0 23.26 0.00 0.00 0
64 16 float sum -1 23.21 0.00 0.00 0 23.53 0.00 0.00 0
128 32 float sum -1 23.43 0.01 0.01 0 23.08 0.01 0.01 0
256 64 float sum -1 23.44 0.01 0.02 0 23.28 0.01 0.02 0
512 128 float sum -1 26.47 0.02 0.03 0 24.22 0.02 0.04 0
1024 256 float sum -1 27.07 0.04 0.07 0 23.27 0.04 0.08 0
2048 512 float sum -1 23.40 0.09 0.15 0 23.62 0.09 0.15 0
4096 1024 float sum -1 23.64 0.17 0.30 0 23.31 0.18 0.31 0
8192 2048 float sum -1 24.08 0.34 0.60 0 23.68 0.35 0.61 0
16384 4096 float sum -1 23.66 0.69 1.21 0 23.90 0.69 1.20 0
32768 8192 float sum -1 24.50 1.34 2.34 0 23.69 1.38 2.42 0
65536 16384 float sum -1 24.93 2.63 4.60 0 25.00 2.62 4.59 0
131072 32768 float sum -1 31.99 4.10 7.17 0 29.86 4.39 7.68 0
262144 65536 float sum -1 88.35 2.97 5.19 0 87.11 3.01 5.27 0
524288 131072 float sum -1 105.4 4.97 8.70 0 110.2 4.76 8.32 0
1048576 262144 float sum -1 108.9 9.63 16.85 0 109.7 9.56 16.72 0
2097152 524288 float sum -1 155.5 13.49 23.61 0 170.1 12.33 21.57 0
4194304 1048576 float sum -1 301.8 13.90 24.32 0 300.8 13.94 24.40 0
8388608 2097152 float sum -1 617.3 13.59 23.78 0 605.7 13.85 24.24 0
16777216 4194304 float sum -1 1246.0 13.46 23.56 0 1225.7 13.69 23.95 0
33554432 8388608 float sum -1 2538.0 13.22 23.14 0 2544.2 13.19 23.08 0
67108864 16777216 float sum -1 5056.2 13.27 23.23 0 5056.2 13.27 23.23 0
134217728 33554432 float sum -1 10089 13.30 23.28 0 10108 13.28 23.24 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.46603
#
@peterhj @Flamefire @aaronp24 @chr1sj0nes
Are you sure your NCCL environment is the same on both runs? Perhaps compare the log of the two runs? In particular, it seems NCCL_SOCKET_IFNAME
is set to eno
in the PyTorch run, leading to NCCL using both eno1
and eno2
:
g3-xlarge-x86-dal-1:3342:3734 [1] NCCL INFO NET/Socket : Using [0]eno1:160.202.129.119<0> [1]eno2:10.87.1.117<0>
Is that what you were using with the NCCL perf tests to get 23 GB/s?
Hi there, I try to run this test (https://github.com/pytorch/examples/tree/main/distributed/FSDP) to check if my cuda and GPU works fine. I disabled both ACS and IOMMU. But the process always hang before in there.And Ctrl + C won't kill it Every time I have to restart server