NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 822 forks source link

Training speed anomalies in multi-node task on Networking Dragonfly Topology #883

Open SHshenhao opened 1 year ago

SHshenhao commented 1 year ago
Hi there, I'm running a multi-node training task on a SLURM cluster with a Networking Dragonfly Topology. Some of the nodes have double Infiniband while others have single Infiniband, and my nodes are allocated across multiple switches. I noticed some strange behavior in the training speed, sometimes where every 300 iterations (about 5s/iter), there is an addition of 0.01s/iter, and sometimes the speed becomes very slow for about an hour and then recovers to around 80% of the previous speed. Whether using nodes with only double Infiniband or a mix of them。 I ran some basic Infiniband (ib) and NCCL tests, but aside from some minor deviations in bandwidth, I didn't observe anything unusual. I did twice nccltest with same nodes only double Infiniband, and the AR of the switchs was not active. NCCL TEST bandwidth(GB/s) FIRST TEST SECOND TEST
8 nodes 16 nodes 32 nodes 8 nodes 16 nodes 32 nodes
48.3 36.845 39.595 47.435 42.73 31.39
46.785 45.735 48.495 44.595 31.39
48.2 48.17
45.82 48.46

After those tests, Network Engineer checked all the switchs and set the AR of the switch to active.And I did the third test.

| ----------- | | 8 nodes | | 48.105 | However, I'm still having trouble figuring out what might be causing the issue. Could you please help me to troubleshoot this further?

Jack47 commented 1 year ago

hey guys, any update here?

sjeaugey commented 1 year ago

I don't have any clue, although I do not have experience with dragonfly topologies.

NCCL perf tests don't show problems, so it's not even clear there is anything wrong with the network or NCCL. It could be something in the framework causing slowdowns due to CPU or GPU scheduling, throttling (due to overheating), or other.

Jack47 commented 1 year ago

NCCL perf tests shows bandwith is below 40 GB/s in 32 nodes, we want to know how to dig the root cause. We believe it should be above 45GB/s?

sjeaugey commented 1 year ago

Sorry I wasn't sure which problem you wanted us to help with.

There could be many reasons for performance to decrease at scale. First, you'll need larger and larger buffers to achieve peak bandwidth. Then, your network fabric could be using the same link in upper levels of the fabric for multiple rings (or trees), causing bottlenecks, in particular if your network is not rail-optimized. Enabling AR should solve that kind of static routing issues, but you didn't share numbers with AR beyond 8 nodes, which is perfect in both cases so I can't really deduce anything from those numbers.

Jack47 commented 1 year ago

sorry for the misleading tables, the numbers with AR beyond 8 nodes are almost the same without AR. so it's wired, do you have any advice to solve this ?

sjeaugey commented 1 year ago

I'd need a lot more context.

What does "TEST ONE" vs "TEST TWO" mean?

What does "AR enabled" vs "AR Disabled" mean? Did you change something in NCCL, or in the fabric manager (UFM or opensm)?

What is your node topology? How many NICs, speed per NIC, GPU type, NVLink topology.

What is your fabric topology? Is it rail-optimized or not? When running on 8/16/32 is there a point where we jump to an extra network level? You mention a dragonfly topology -- can you explain where the dragonfly connections are?

SHshenhao commented 1 year ago

I'm sorry for the misleading tables. Sorry for not describing clearly. So I update my table. The partition I tested is just a part of the whole cluster. The topology of the entire cluster network is dragonfly with 10 spain switchs and about 20+ leaf switchs. Every spain link to each leaf and every leaf link to each spain. 10-20 nodes connected to each leaf switch with one of the two ib cards. The partition I tested with same node, A100 8, 2 200G ib card and 2 * Ethernet card. However, in other partitions, the situation is different, and there will be a variety of situations and nodes such as V100 and single 100G IB. A small number of nodes in my partition share a few leaf switchs with other partitions. But the network engineers insisted that there is no problem, because their traffic monitoring on the switchs told them that there has never been a single interface with more than 120Gb/s, and the switchs has never had a traffic alarm. I did nccltest three times. 8, 16, and 32 nodes were tested for the first and second time. After Network Enginee checked the switchs and set the AR of the switch to active, I did the third test. After the third test,I also did a few nccl test when the partition is not that busy. However, sometimes 28, 39, 41GB/s bandwidth can still be tested on some 16-32+ node nccltest, especially when the number of nodes is larger.

sjeaugey commented 1 year ago

Network Engineer checked all the switchs and set the AR of the switch to active.

I'd need more precise information on what "set the AR of the switch to active" means. That's not how it's done usually, the AR setting is usually in the fabric manager (e.g. UFM or opensm). Maybe it's also something we can set on the switch in the case of a managed switch, but then I'd need the precise commands he ran and in general, how the switch is configured w.r.t. adaptive routing.

The topology of the entire cluster network is dragonfly with 10 spain switchs and about 20+ leaf switchs. Every spain link to each leaf and every leaf link to each spain. 10-20 nodes connected to each leaf switch with one of the two ib cards.

That looks like a fat tree to me, not a dragonfly. Can you explain why you think it is a dragonfly? Typically dragonfly topologies do not have spine switches, and connect switches (or blocks of switches) directly to each other.

their traffic monitoring on the switchs told them that there has never been a single interface with more than 120Gb/s, and the switchs has never had a traffic alarm.

I'm not sure what you can see from the network monitoring. You'd need to run NCCL perf tests for a very long time, e.g. with all_reduce_perf -n 2000 -d all -o all -b 256M -e 256M, since the monitoring sampling time is quite large.

SHshenhao commented 1 year ago

AdaptiveRouting is enabled on 0 switches.

Our network engineer did ibdiagnet. And then they checked and set "routing_engine updn" to "routing_engine ar_updn"

That looks like a fat tree to me, not a dragonfly. Can you explain why you think it is a dragonfly? Typically dragonfly topologies do not have spine switches, and connect switches (or blocks of switches) directly to each other.

Sorry for that. It is more like a fat tree, not a dragonfly.

I'm not sure what you can see from the network monitoring. You'd need to run NCCL perf tests for a very long time, e.g. with all_reduce_perf -n 2000 -d all -o all -b 256M -e 256M, since the monitoring sampling time is quite large.

You are right. Thank you so much for your suggestion. I will try it. I tried to do a long time test, but for me it is not easy to get chance to do that until now, because the cluster is so busy.
So I checked the network monitoring info from network engineers, which is nearly the only way they can offer. Although the monitoring effect of the switch monitoring system is very unsatisfactory, the minimum sampling frequency is 5 minutes, and data query, especially historical data query, is also very difficult to use.

sjeaugey commented 1 year ago

they checked and set "routing_engine updn" to "routing_engine ar_updn"

Did they set that in the opensm config and then restarted opensm? Or are you using some other tool? Can they also check what the AR SL mask is? Something like:

ar_sl_mask 0xFFFF

If not set to 0xFFFF, you may need to set NCCL_IB_SL to a service level where AR is enabled.

Jack47 commented 1 year ago

yes they restarted opensm, then they thecked that AR is enabled on all(24) switches

企业微信截图_bdae0db1-caeb-4b76-86fe-d4ec9d322662
sjeaugey commented 1 year ago

Can you get the information about the ar_sl_mask in the opensm configuration? Being enabled is good -- but it doesn't mean it's enabled on all Service Levels (SLs).