Open lcw2 opened 10 months ago
It would be helpful to provide the topology dump (set NCCL_TOPO_DUMP_FILE=system.txt
then post the generated file).
It's still not clear to me whether the peak theoretical is 100GB/s or 200GB/s (as NICs are dual port -- but are they Gen4 or Gen5?).
system.txt: system.txt PCIe:
And I also curious about why is the PCIe 4 bandwidth considered to be 24GB/s. @sjeaugey
why is the PCIe 4 bandwidth considered to be 24GB/s.
Why not? That's what we see, just like Gen3 can reach 12 GB/s and Gen5 can reach 48 GB/s. Maybe a bit more than that, but not much, so those numbers are the ones we aim for.
Hello, We ary trying to test all-reduce performance for A100 of two nodes, eyery node the configuration is as follows: A100 8, IB 200Gb 4 (dual port),
We tested three scenarios:
ib_bw_write of NIC within two nodes
singleNode, ./build/all_reduce_perf -b 8 -e 4G -f 2 -g 8, result: 232.54GB/s
nccl all-reduce between two nodes only 86GB/s。There is a big gap compared to the theoretical value(100 GB/s), why is this? log: test.log