A100 - All reduce performance

This is the GPUDirect communication matrix of the node I'm working with.

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    NV12     X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NODE    NODE     X      NV12    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NODE    NODE    NV12     X      SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    SYS     SYS     SYS     SYS      X      NV12    NODE    NODE    32-63,96-127    1
GPU5    SYS     SYS     SYS     SYS     NV12     X      NODE    NODE    32-63,96-127    1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV12    32-63,96-127    1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV12     X      32-63,96-127    1

When I run all-reduce test like this, the test finishes quickly (as one would expect since there is a direct NVLink connection). Bus_BW: 208 GBps

./build/all_reduce_perf -b 8 -e 12G -f 2 -g 1 -t 2 -o avg

When I run all-reduce test like below, the test takes a long time. bus_bw drops to 20 GBps.

./build/all_reduce_perf -b 8 -e 12G -f 2 -g 1 -t 3 -o avg

Is there a way to configure this test to still get bus_bw close to 200 GBps i.e make all_reduce use the NVLink ring connections to achieve higher bandwidth? ( I don't know how nccl decides which link to use when sending data. For example, sending data from directly from GPU0 to GPU2 would be slow but sending it from GPU0 to GPU1 to GPU2 would be faster)

Also, does anybody know if there is a way to find out if the node is a DGX, EGX , HGX or IGX? I'm trying to find out why the NVLinks are in this ring setup as opposed to the all-to-all setup that most of the search results seem to indicate. Sorry, I inherited this system without much documentation and the node is in a remote location

Thank you!

NVIDIA / nccl-tests

A100 - All reduce performance #178