NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

what do algobw actually mean when I run test with more than one nodes?speed between nodes or speed between gpus. #160

Closed wenjunlong closed 1 year ago

sjeaugey commented 1 year ago

On 2 nodes, it depends which algorithm is used. If we use tree, the performance will be limited by a mix of NVLink and NIC bandwidth so it's not reflecting physical speeds. If we use ring, the busbw is going to reflect exactly what's going through the NICs.

On 3+ nodes we should use rings for large enough sizes, so the 80 GB/s is probably reflecting the total NIC bandwidth, hence 20GB/s per NIC which is not 24 (what we see ourselves in general on perfect setups) but not too bad either. Also note, you need to use large enough sizes as to see the real peak bandwidth. You didn't share the NCCL perf tests output so I can't tell whether you've reached the peak or not. If you do, please run from 8 bytes to 8 GB (-b 8 -e 8G -f 2).

wenjunlong commented 1 year ago

thanks

sjeaugey commented 1 year ago

Ok, thanks. So indeed the peak BW is a bit lower than I'd expect. Maybe it would be worth checking with the networking support team that the NIC firmwares are upgraded to the latest and correctly tuned.

But before that, can you run with a single NIC (NCCL_IB_HCA=mlx5_0 then mlx5_1, etc ...) and see what performance you get with each NIC individually?