Open arul-lm opened 9 months ago
In a ring-based AllReduce the throughput will be limited by the slowest link in the ring, and as you only have NVLinks between pairs of cards there will always be a hop that falls back to NODE/SYS e.g.
NV12 NODE NV12 SYS 0 --> 1 --> 2 --> 3 --> 4 .... etc.
This is the GPUDirect communication matrix of the node I'm working with.
When I run all-reduce test like this, the test finishes quickly (as one would expect since there is a direct NVLink connection). Bus_BW: 208 GBps
When I run all-reduce test like below, the test takes a long time. bus_bw drops to 20 GBps.
Is there a way to configure this test to still get bus_bw close to 200 GBps i.e make all_reduce use the NVLink ring connections to achieve higher bandwidth? ( I don't know how nccl decides which link to use when sending data. For example, sending data from directly from GPU0 to GPU2 would be slow but sending it from GPU0 to GPU1 to GPU2 would be faster)
Also, does anybody know if there is a way to find out if the node is a DGX, EGX , HGX or IGX? I'm trying to find out why the NVLinks are in this ring setup as opposed to the all-to-all setup that most of the search results seem to indicate. Sorry, I inherited this system without much documentation and the node is in a remote location
Thank you!