Two A100 nodes cannot reach ideal all-reduce performance

NVIDIA / nccl-tests

NCCL Tests

BSD 3-Clause "New" or "Revised" License

809 stars 229 forks source link

Two A100 nodes cannot reach ideal all-reduce performance #175

Open lcw2 opened 10 months ago

lcw2 commented 10 months ago

Hello, We ary trying to test all-reduce performance for A100 of two nodes, eyery node the configuration is as follows: A100 8, IB 200Gb 4 (dual port),

We tested three scenarios:

ib_bw_write of NIC within two nodes
singleNode, ./build/all_reduce_perf -b 8 -e 4G -f 2 -g 8, result: 232.54GB/s
nccl all-reduce between two nodes only 86GB/s。There is a big gap compared to the theoretical value(100 GB/s), why is this? log: test.log

sjeaugey commented 10 months ago

It would be helpful to provide the topology dump (set NCCL_TOPO_DUMP_FILE=system.txt then post the generated file).

It's still not clear to me whether the peak theoretical is 100GB/s or 200GB/s (as NICs are dual port -- but are they Gen4 or Gen5?).

lcw2 commented 10 months ago

system.txt: system.txt PCIe:

lcw2 commented 10 months ago

And I also curious about why is the PCIe 4 bandwidth considered to be 24GB/s. @sjeaugey

sjeaugey commented 10 months ago

why is the PCIe 4 bandwidth considered to be 24GB/s.

Why not? That's what we see, just like Gen3 can reach 12 GB/s and Gen5 can reach 48 GB/s. Maybe a bit more than that, but not much, so those numbers are the ones we aim for.