Closed wenjunlong closed 1 year ago
thanks
Ok, thanks. So indeed the peak BW is a bit lower than I'd expect. Maybe it would be worth checking with the networking support team that the NIC firmwares are upgraded to the latest and correctly tuned.
But before that, can you run with a single NIC (NCCL_IB_HCA=mlx5_0
then mlx5_1
, etc ...) and see what performance you get with each NIC individually?
On 2 nodes, it depends which algorithm is used. If we use tree, the performance will be limited by a mix of NVLink and NIC bandwidth so it's not reflecting physical speeds. If we use ring, the busbw is going to reflect exactly what's going through the NICs.
On 3+ nodes we should use rings for large enough sizes, so the 80 GB/s is probably reflecting the total NIC bandwidth, hence 20GB/s per NIC which is not 24 (what we see ourselves in general on perfect setups) but not too bad either. Also note, you need to use large enough sizes as to see the real peak bandwidth. You didn't share the NCCL perf tests output so I can't tell whether you've reached the peak or not. If you do, please run from 8 bytes to 8 GB (
-b 8 -e 8G -f 2
).