Inconsistent all_reduce busbw between 2 nodes

zhengwy888 commented 2 years ago

Hello NCCL expert,

I am looking into NCCL bandwidth issues in the same datacenter where some machines were able to run at 2x the busbw than other set of machines. But I can't tell exactly why that's happening.

please see the 2 attached log file, the good pair were able to achieve ~80GB/s, while the bad pair where only seeing ~40GB/s good_speed_nccl.log bad_speed_nccl.log

Here is my command:

/cm/shared/apps/slurm/current/bin/srun --mpi=pmi2 --propagate=NONE --exclusive --ntasks $task -o $HOME/logs/nccltest/${OP}-$CLUSTER_ID-$jobname.log -J $jobname --tasks-per-node=1 --gres=gpu:8 -p $PARTITION  --export=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_LIB:$HOME/repos/nccl/build/lib,NCCL_ALGO=Ring,NCCL_DEBUG=INFO $HOME/repos/nccl-tests/build/${OP}_perf -b 64M -e 2048M -f 2 -g 8

and here is the nvidia-smi topo output

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16,21-31,144    1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16,21-31,144    1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     112-127,240-255 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     112-127,240-255 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS
mlx5_1  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS
mlx5_2  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS
mlx5_3  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS
mlx5_4  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS
mlx5_5  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS
mlx5_6  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
mlx5_7  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ibstatus showed all 4 links are up and running at 200G/s.

Can you help to take a look and see how to debug this problem further? Thank you.

sjeaugey commented 2 years ago

Can you run again with NCCL_ALGO=RING? By default, on two nodes, we use the Tree algorithm, which does not directly reflect the GPU-NIC speed, and that makes it harder to "see" issues through the BW numbers.

I can see two reasons for that reduced performance:

Network routing collisions. Some nodes may be lucky and they can have all their flows in each direction use different links throughout the network, while some other pairs may have flows using the same link at some point in the routing, causing a bottleneck. That could easily halve the performance. This does not happen on single switches, nor on rail-optimized topologies, but is frequent on classic per-rack network cabling. See slide 24 on network topology (or watch at 21:58) of my GTC talk this year. What is your network topology? Do you have multiple levels of switches? If so, is your network topology rail-optimized? If not, is adaptive routing enabled?

Node misconfiguration, for example ACS being enabled on a node, causing all traffic to be routed to the CPU root complex, thereby halving the performance. See slide 26 (or watch the video at 24:23).

In the log, nodes 641 and 644 are slow, while 641 and 632 are fast. Is any node capable of reaching perfect bandwidth together with node 644? If not, then it's probably a node config issue (point 2 above). If there is a node which can get good performance with 644, it's probably a network fabric issue (point 1 above).

zhengwy888 commented 2 years ago

thanks for the quick reply. I am already running the test with NCCL_ALGO=Ring.

ACS: here is a output of sudo lspci -vvv, and I don't think there are ACS enabled on PLX devices. gn644_lspci.log

Routing: we use a classic per-rack network, with Adaptive Routing enabled. I did a bit more testing and here is what I see

gn644,gn699     49.36
gn632,gn641     88.36 
gn284,gn678     49.28  # crossing TOR
gn641,gn644     49.30
gn404,gn699     86.81  # crossing TOR

I initially suspected slow down due to crossing TOR, but doesn't look like it. Do you have any method to recommend to allow further debugging this issue?

sjeaugey commented 2 years ago

From what you tried, I'd say that node 644 has a problem and is slow. Can you run again with se02gn644,se02gn699 and set NCCL_TOPO_DUMP_FILE=system.txt then post the system.txt here? The topology dump is done by rank 0 by default so make sure rank 0 is on node 644.

zhengwy888 commented 2 years ago

I dump the system.out from both fast tests and slow machines (unfortunately not on the same se02gn644), but their topo file seems to be the same. slow_system.log fast_system.log

How to debug further? I don't know if we can rule out network collision completely yet, do you know what tools/metrics I can use to deduce there is no network collision happening?

sjeaugey commented 2 years ago

Ok, indeed the node topology seems okay. Weird that the PCI numbering is slightly different but that should not be a cause for halved performance.

For network routing issues, I would usually just use adaptive routing, but from what you mentioned it's supposed to be there already.

I would continue to try other nodes with 644. If you find one node which gets good performance with 644, then I'd look at the network fabric again. If NO node gets good performance then it's an issue with the node. Could also be the switch attached to the node having missing links, or other local issue.

zhengwy888 commented 2 years ago

I grabbed 16 nodes from our farm and plotted the busbw between arbitrary 2 nodes. the picture makes me believe it's the network fabric. allreduce_bw

Also according to ibdiagnet, adaptive routing is enabled. If it's indeed the network, how to debug it?

sjeaugey commented 2 years ago

Is adaptive routing enabled on all Infiniband Service Levels (SL)? Adaptive routing has a mask, and if the mask is not all "f" then you may need to set NCCL_IB_SL to select an SL which has adaptive routing enabled.

Other than that, I would also check the health of the uplinks, making sure you have all the uplinks on all switches and no high error rates.

zhengwy888 commented 2 years ago

Thank you so much for the pointer. Our EN_SL_MASK was set to 0xFFFE. Changing the NCCL_IB_SL from 0 to 1 generated the following connectivity graph. 20% of pairs were able to reach 70GB/s or better at SL=1, it was 13% at SL=0, so better. It seems 2 machines are more consistent than others. nccl_bw_anon_sl_1

Follow up questions:

Should I be expecting full 80GB/s out of all IB connections between arbitrary pairs of nodes? or was that unrealistic?
I am curious why the default mask was 0xFFFE? Was it to prevent legacy MPI application crashing because of out-of-order packets?
On each machine the first IB HCA was shared with NFS and GPU. But there aren't any NFS traffic during the time when nccl-tests were ran. Not sure if this configuration will affect nccl-test?
How to proceed from here?

sjeaugey commented 2 years ago

It's pretty common to have adaptive routing being enabled on all SLs except the default one. I've never seen an issue due to AR but it's just a classic opt-in policy.

Now, it doesn't look like AR was enabled on the table above. To me, that still looks very much like static routing, where some (lucky) pairs get full bandwidth, most get 1/2 bandwidth, some get 1/3 bandwidth and some even 1/4. Which is a typical probabilities distribution, whereas adaptive routing should give you a near-uniform performance across all cases, albeit potentially a little bit lower than 100% peak (typically in your case, I'd expect 80-85 GB/s across the board).

As to how to make AR work as expected, I'm not expert enough to guide you on fabric config, so you should reach out to our networking team to confirm how adaptive routing is configured and how to enable it.

NVIDIA / nccl-tests

Inconsistent all_reduce busbw between 2 nodes #106