NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

A100 - All reduce performance #178

Open arul-lm opened 9 months ago

arul-lm commented 9 months ago

This is the GPUDirect communication matrix of the node I'm working with.

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    NV12     X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NODE    NODE     X      NV12    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NODE    NODE    NV12     X      SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    SYS     SYS     SYS     SYS      X      NV12    NODE    NODE    32-63,96-127    1
GPU5    SYS     SYS     SYS     SYS     NV12     X      NODE    NODE    32-63,96-127    1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV12    32-63,96-127    1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV12     X      32-63,96-127    1

When I run all-reduce test like this, the test finishes quickly (as one would expect since there is a direct NVLink connection). Bus_BW: 208 GBps

./build/all_reduce_perf -b 8 -e 12G -f 2 -g 1 -t 2 -o avg

When I run all-reduce test like below, the test takes a long time. bus_bw drops to 20 GBps.

./build/all_reduce_perf -b 8 -e 12G -f 2 -g 1 -t 3 -o avg

Is there a way to configure this test to still get bus_bw close to 200 GBps i.e make all_reduce use the NVLink ring connections to achieve higher bandwidth? ( I don't know how nccl decides which link to use when sending data. For example, sending data from directly from GPU0 to GPU2 would be slow but sending it from GPU0 to GPU1 to GPU2 would be faster)

Also, does anybody know if there is a way to find out if the node is a DGX, EGX , HGX or IGX? I'm trying to find out why the NVLinks are in this ring setup as opposed to the all-to-all setup that most of the search results seem to indicate. Sorry, I inherited this system without much documentation and the node is in a remote location

Thank you!

david-macleod commented 8 months ago

In a ring-based AllReduce the throughput will be limited by the slowest link in the ring, and as you only have NVLinks between pairs of cards there will always be a hop that falls back to NODE/SYS e.g.

NV12 NODE NV12 SYS 0 --> 1 --> 2 --> 3 --> 4 .... etc.