Two A800 nodes cannot reach ideal all-reduce performance

joydchh commented 1 year ago

Hello,

We are trying to set up the VM environment for A800, the configuration is as follows:

Virtualization
- qemu 2.10
- GPU: passthrough
- NIC: SRIOV (VF)
VM size : A800 8, RoCE 200Gb 4 (for computing), RoCE 100Gb * 1(for management)

We have mapped the topo in VM close to host, GPU0/1 and mlx5_1 in the same PCIe switch, GPU2/3 and mlx5_2 in the same PCIe switch, etc.

We tested three scenarios:

ib_bw_write for all pairs of NIC within two nodes(VMs)
nccl all-reduce in single vm
nccl all-reduce between two nodes(VMs)

Scenario 1 & 2 can reach almost the same with bare metal(no qemu virtualization involved), but the third case dropped almost 25%, with only 76GB/s bus bw, which reaches 96GB/s in bare metal.

We don't have a clue to dig more deeper for the result. Any idea to debug?

joydchh commented 1 year ago

Hello,

We are trying to set up the VM environment for A800, the configuration is as follows:

Virtualization

qemu 2.10

GPU: passthrough

NIC: SRIOV (VF)

VM size : A800 8, RoCE 200Gb 4 (for computing), RoCE 100Gb * 1(for management)

We have mapped the topo in VM close to host, GPU0/1 and mlx5_1 in the same PCIe switch, GPU2/3 and mlx5_2 in the same PCIe switch, etc.

We tested three scenarios:

ib_bw_write for all pairs of NIC within two nodes(VMs)

nccl all-reduce in single vm

nccl all-reduce between two nodes(VMs)

Scenario 1 & 2 can reach almost the same with bare metal(no qemu virtualization involved), but the third case dropped almost 25%, with only 76GB/s bus bw, which reaches 96GB/s in bare metal.

We don't have a clue to dig more deeper for the result. Any idea to debug?

Attach a nccl trace log, hope it's helpful. nccl-test.log

sjeaugey commented 1 year ago

For 1. did you run the test bidirectionally? You would need to have traffic in both directions to mimic NCCL usage.

Regarding inter-node communication, to get peak performance within a VM, you need to enable ATS so that NIC-GPU traffic doesn't go to the CPU root complex. Was that done?

joydchh commented 1 year ago

Yes. Besides, we got big progress today, by setting NCCL_NET_GDR_READ to 0, NCCL all-reduce can reach 96GB/s, same peak with host. But we got confused with this. Thought NCCL_NET_GDR_READ with 1(enabled) would have better performance. Do you know why?

sjeaugey commented 1 year ago

Interesting. Disabling GDR for reads means there will be traffic (only in one direction) between the GPU and the CPU; it will use more PCI bandwidth, and put more pressure on CPU memory. But that means you have a performance issue with GDR reads.

Did you run the ib_bw_write tests on CPU memory or GPU memory? (i.e. did you use the CUDA-enabled IB perf tests?) You should be able to see issues with reads if you use GPU memory. It could be a problem of tuning the right amount of read buffers, as discussed here: https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/2791440385/GPUDirect+Benchmarking#Adapter-Firmware-Configuration 200G NICs usually need a value of 44 to perform at full speed.

pokitpeng commented 1 year ago

Yeah,before the test,we did this:

mlxconfig -d <pci_addr> --yes set ADVANCED_PCI_SETTINGS=1
mlxconfig -d <pci_addr> --yes set MAX_ACC_OUT_READ=44

As for the ib_write_bw test, we have done it and it can reach the best performance by configuring the number of QPs. ib_write_bw.log ib_write_bw_with_cuda_faster.log ib_write_bw_with_cuda_slower.log

sjeaugey commented 1 year ago

Can you try running NCCL tests with NCCL_IB_QPS_PER_CONNECTION=2 NCCL_IB_SPLIT_DATA_ON_QPS=0?

joydchh commented 1 year ago

Can you try running NCCL tests with NCCL_IB_QPS_PER_CONNECTION=2 NCCL_IB_SPLIT_DATA_ON_QPS=0?

We found NCCL_NET_GDR_READ set to 1 and NCCL_NET_GDR_LEVEL to 3 also can reach full speed between two VMs. It's much more reasonable settings.

But we met new problems when adding the third VM into the test group. Each two of the three VMs can reach full speed with NCCL all_reduce test, but drop to 65GB/s when running all of them together. The logs are attached, do you have some insights about this? (We got trapped :<)

3-nodes-nccl-test.log

2-nodes-nccl-test.log 3-nodes-export-topo.log

sjeaugey commented 1 year ago

NCCL_NET_GDR_READ=1 and NCCL_NET_GDR_LEVEL=3 will incur additional pressure on the CPU memory. That can have a performance impact on real applications. Hence my question about other settings which would not have such negative impact.

I'm realizing the experiments on 2 VMs did not set NCCL_ALGO=RING. So you were not measuring the real network bandwidth if NCCL was using the Tree algorithm; on 2 nodes, the amount of traffic on the network is not the reported BusBw. Maybe the problem is simply that ATS is not enabled, hence all traffic goes back to the CPU and your bandwidth is halved.

joydchh commented 1 year ago

NCCL_NET_GDR_READ=1 and NCCL_NET_GDR_LEVEL=3 will incur additional pressure on the CPU memory. That can have a performance impact on real applications. Hence my question about other settings which would not have such negative impact.

I'm realizing the experiments on 2 VMs did not set NCCL_ALGO=RING. So you were not measuring the real network bandwidth if NCCL was using the Tree algorithm; on 2 nodes, the amount of traffic on the network is not the reported BusBw. Maybe the problem is simply that ATS is not enabled, hence all traffic goes back to the CPU and your bandwidth is halved.

I just checked ATS are enabled on mlx cards.

And, yes, on 2 VMs we set NCCL_ALGO=Tree. RING had a much lower BusBw. Does it mean that we actually didn't reach the ideal network bandwidth on 2 VMs?

joydchh commented 1 year ago

BTW, you mentioned Ring should reflect the network bandwidth when there were just two VMs. We went back to check the topo injected in VM, and adjusted it to the hierarchy below:

virtualTopology.log

Now, Ring and Tree can both reach 65GB/s in three VMs test. Seems much more reasonable compared to the previous result, in which case Ring just got 20GB/s. But 65GB/s still isn't the ideal result. Could it be the topo problem(topo is also attached), which cannot emulate pcie switch exactly like in the host?

sjeaugey commented 1 year ago

It could be due to many reasons. One of them being that ATS (despite being enabled on the NIC) is not being used, as it needs many things to align to be actually used.

When running ib_write_bw, did you run 8 instances in parallel to have all 8 NICs running bidirectionally, full speed, reading and writing to the GPU memory (i.e. using the CUDA-enabled version of the IB perftests)? Running it that way ensures it's equivalent to what NCCL does and helps figuring out whether it's a NCCL issue or an IB virtualization issue.

joydchh commented 1 year ago

It's the result of running 4 instances in parallel to have all 4 NICs running bidirectionally, almost the same speed with running directly in the host. Does this mean the problem is in NCCL, maybe the wrong topology or params?

joydchh commented 1 year ago

It's the result of running 4 instances in parallel to have all 4 NICs running bidirectionally, almost the same speed with running directly in the host. Does this mean the problem is in NCCL, maybe the wrong topology or params?

Also attach the topo we built in the VM. Host.

WeChatWorkScreenshot_bd21247e-4349-4fe9-a804-049d71a14bda

host.xml.log

VM.

WeChatWorkScreenshot_6cc27557-a2d2-4d2f-a868-e3853f870430

vm_v2.xml.log

sjeaugey commented 1 year ago

Can you share the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV? That would help me see if the topology injection didn't work as expected.

pokitpeng commented 1 year ago

Can you share the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV? That would help me see if the topology injection didn't work as expected.

Here are logs running on two virtual machines, using two algorithms. nccl-vm-use-ring-algo.log nccl-vm-use-tree-algo.log

joydchh commented 1 year ago

@sjeaugey Do you find something wrong from NCCL output? We also captured a lot pause frame from NIC side when running all-reduce, but everything works on well in GDR test(perftest with --use_cuda).

sjeaugey commented 1 year ago

Sorry for the delay. Everything seems normal in the log: we find the 4 NICs and the 4 paths running at 24GB/s.

I can only imagine that the ACS/ATS system is causing a bottleneck, or the additional PCI latency is causing network issues. Not sure why you didn't see that with low-level IB perftests though. I'd encourage you to contact the networking support, they might be able to better help you ensure everything is setup correctly.

apoorvemohan commented 1 year ago

@joydchh Were you able to resolve the issue? If yes, could you please share the root cause and solution.

NVIDIA / nccl-tests

Two A800 nodes cannot reach ideal all-reduce performance #156