Open zhengwy888 opened 2 years ago
Can you run again with NCCL_ALGO=RING
? By default, on two nodes, we use the Tree algorithm, which does not directly reflect the GPU-NIC speed, and that makes it harder to "see" issues through the BW numbers.
I can see two reasons for that reduced performance:
In the log, nodes 641 and 644 are slow, while 641 and 632 are fast. Is any node capable of reaching perfect bandwidth together with node 644? If not, then it's probably a node config issue (point 2 above). If there is a node which can get good performance with 644, it's probably a network fabric issue (point 1 above).
thanks for the quick reply. I am already running the test with NCCL_ALGO=Ring.
ACS: here is a output of sudo lspci -vvv
, and I don't think there are ACS enabled on PLX devices.
gn644_lspci.log
Routing: we use a classic per-rack network, with Adaptive Routing enabled. I did a bit more testing and here is what I see
gn644,gn699 49.36
gn632,gn641 88.36
gn284,gn678 49.28 # crossing TOR
gn641,gn644 49.30
gn404,gn699 86.81 # crossing TOR
I initially suspected slow down due to crossing TOR, but doesn't look like it. Do you have any method to recommend to allow further debugging this issue?
From what you tried, I'd say that node 644 has a problem and is slow. Can you run again with se02gn644,se02gn699 and set NCCL_TOPO_DUMP_FILE=system.txt then post the system.txt here? The topology dump is done by rank 0 by default so make sure rank 0 is on node 644.
I dump the system.out from both fast tests and slow machines (unfortunately not on the same se02gn644), but their topo file seems to be the same. slow_system.log fast_system.log
How to debug further? I don't know if we can rule out network collision completely yet, do you know what tools/metrics I can use to deduce there is no network collision happening?
Ok, indeed the node topology seems okay. Weird that the PCI numbering is slightly different but that should not be a cause for halved performance.
For network routing issues, I would usually just use adaptive routing, but from what you mentioned it's supposed to be there already.
I would continue to try other nodes with 644. If you find one node which gets good performance with 644, then I'd look at the network fabric again. If NO node gets good performance then it's an issue with the node. Could also be the switch attached to the node having missing links, or other local issue.
I grabbed 16 nodes from our farm and plotted the busbw between arbitrary 2 nodes. the picture makes me believe it's the network fabric.
Also according to ibdiagnet, adaptive routing is enabled. If it's indeed the network, how to debug it?
Is adaptive routing enabled on all Infiniband Service Levels (SL)? Adaptive routing has a mask, and if the mask is not all "f" then you may need to set NCCL_IB_SL to select an SL which has adaptive routing enabled.
Other than that, I would also check the health of the uplinks, making sure you have all the uplinks on all switches and no high error rates.
Thank you so much for the pointer. Our EN_SL_MASK was set to 0xFFFE. Changing the NCCL_IB_SL from 0 to 1 generated the following connectivity graph. 20% of pairs were able to reach 70GB/s or better at SL=1, it was 13% at SL=0, so better. It seems 2 machines are more consistent than others.
Follow up questions:
It's pretty common to have adaptive routing being enabled on all SLs except the default one. I've never seen an issue due to AR but it's just a classic opt-in policy.
Now, it doesn't look like AR was enabled on the table above. To me, that still looks very much like static routing, where some (lucky) pairs get full bandwidth, most get 1/2 bandwidth, some get 1/3 bandwidth and some even 1/4. Which is a typical probabilities distribution, whereas adaptive routing should give you a near-uniform performance across all cases, albeit potentially a little bit lower than 100% peak (typically in your case, I'd expect 80-85 GB/s across the board).
As to how to make AR work as expected, I'm not expert enough to guide you on fabric config, so you should reach out to our networking team to confirm how adaptive routing is configured and how to enable it.
Hello NCCL expert,
I am looking into NCCL bandwidth issues in the same datacenter where some machines were able to run at 2x the busbw than other set of machines. But I can't tell exactly why that's happening.
please see the 2 attached log file, the good pair were able to achieve ~80GB/s, while the bad pair where only seeing ~40GB/s good_speed_nccl.log bad_speed_nccl.log
Here is my command:
and here is the nvidia-smi topo output
ibstatus showed all 4 links are up and running at 200G/s.
Can you help to take a look and see how to debug this problem further? Thank you.