Open SHshenhao opened 1 year ago
hey guys, any update here?
I don't have any clue, although I do not have experience with dragonfly topologies.
NCCL perf tests don't show problems, so it's not even clear there is anything wrong with the network or NCCL. It could be something in the framework causing slowdowns due to CPU or GPU scheduling, throttling (due to overheating), or other.
NCCL perf tests shows bandwith is below 40 GB/s in 32 nodes, we want to know how to dig the root cause. We believe it should be above 45GB/s?
Sorry I wasn't sure which problem you wanted us to help with.
There could be many reasons for performance to decrease at scale. First, you'll need larger and larger buffers to achieve peak bandwidth. Then, your network fabric could be using the same link in upper levels of the fabric for multiple rings (or trees), causing bottlenecks, in particular if your network is not rail-optimized. Enabling AR should solve that kind of static routing issues, but you didn't share numbers with AR beyond 8 nodes, which is perfect in both cases so I can't really deduce anything from those numbers.
sorry for the misleading tables, the numbers with AR beyond 8 nodes are almost the same without AR. so it's wired, do you have any advice to solve this ?
I'd need a lot more context.
What does "TEST ONE" vs "TEST TWO" mean?
What does "AR enabled" vs "AR Disabled" mean? Did you change something in NCCL, or in the fabric manager (UFM or opensm)?
What is your node topology? How many NICs, speed per NIC, GPU type, NVLink topology.
What is your fabric topology? Is it rail-optimized or not? When running on 8/16/32 is there a point where we jump to an extra network level? You mention a dragonfly topology -- can you explain where the dragonfly connections are?
I'm sorry for the misleading tables. Sorry for not describing clearly. So I update my table. The partition I tested is just a part of the whole cluster. The topology of the entire cluster network is dragonfly with 10 spain switchs and about 20+ leaf switchs. Every spain link to each leaf and every leaf link to each spain. 10-20 nodes connected to each leaf switch with one of the two ib cards. The partition I tested with same node, A100 8, 2 200G ib card and 2 * Ethernet card. However, in other partitions, the situation is different, and there will be a variety of situations and nodes such as V100 and single 100G IB. A small number of nodes in my partition share a few leaf switchs with other partitions. But the network engineers insisted that there is no problem, because their traffic monitoring on the switchs told them that there has never been a single interface with more than 120Gb/s, and the switchs has never had a traffic alarm. I did nccltest three times. 8, 16, and 32 nodes were tested for the first and second time. After Network Enginee checked the switchs and set the AR of the switch to active, I did the third test. After the third test,I also did a few nccl test when the partition is not that busy. However, sometimes 28, 39, 41GB/s bandwidth can still be tested on some 16-32+ node nccltest, especially when the number of nodes is larger.
Network Engineer checked all the switchs and set the AR of the switch to active.
I'd need more precise information on what "set the AR of the switch to active" means. That's not how it's done usually, the AR setting is usually in the fabric manager (e.g. UFM or opensm). Maybe it's also something we can set on the switch in the case of a managed switch, but then I'd need the precise commands he ran and in general, how the switch is configured w.r.t. adaptive routing.
The topology of the entire cluster network is dragonfly with 10 spain switchs and about 20+ leaf switchs. Every spain link to each leaf and every leaf link to each spain. 10-20 nodes connected to each leaf switch with one of the two ib cards.
That looks like a fat tree to me, not a dragonfly. Can you explain why you think it is a dragonfly? Typically dragonfly topologies do not have spine switches, and connect switches (or blocks of switches) directly to each other.
their traffic monitoring on the switchs told them that there has never been a single interface with more than 120Gb/s, and the switchs has never had a traffic alarm.
I'm not sure what you can see from the network monitoring. You'd need to run NCCL perf tests for a very long time, e.g. with all_reduce_perf -n 2000 -d all -o all -b 256M -e 256M
, since the monitoring sampling time is quite large.
AdaptiveRouting is enabled on 0 switches.
Our network engineer did ibdiagnet. And then they checked and set "routing_engine updn" to "routing_engine ar_updn"
That looks like a fat tree to me, not a dragonfly. Can you explain why you think it is a dragonfly? Typically dragonfly topologies do not have spine switches, and connect switches (or blocks of switches) directly to each other.
Sorry for that. It is more like a fat tree, not a dragonfly.
I'm not sure what you can see from the network monitoring. You'd need to run NCCL perf tests for a very long time, e.g. with all_reduce_perf -n 2000 -d all -o all -b 256M -e 256M, since the monitoring sampling time is quite large.
You are right.
Thank you so much for your suggestion. I will try it. I tried to do a long time test, but for me it is not easy to get chance to do that until now, because the cluster is so busy.
So I checked the network monitoring info from network engineers, which is nearly the only way they can offer. Although the monitoring effect of the switch monitoring system is very unsatisfactory, the minimum sampling frequency is 5 minutes, and data query, especially historical data query, is also very difficult to use.
they checked and set "routing_engine updn" to "routing_engine ar_updn"
Did they set that in the opensm config and then restarted opensm? Or are you using some other tool? Can they also check what the AR SL mask is? Something like:
ar_sl_mask 0xFFFF
If not set to 0xFFFF, you may need to set NCCL_IB_SL to a service level where AR is enabled.
yes they restarted opensm, then they thecked that AR is enabled on all(24) switches
Can you get the information about the ar_sl_mask
in the opensm configuration? Being enabled is good -- but it doesn't mean it's enabled on all Service Levels (SLs).
After those tests, Network Engineer checked all the switchs and set the AR of the switch to active.And I did the third test.
| ----------- | | 8 nodes | | 48.105 | However, I'm still having trouble figuring out what might be causing the issue. Could you please help me to troubleshoot this further?