Open joydchh opened 1 year ago
Hello,
We are trying to set up the VM environment for A800, the configuration is as follows:
Virtualization
- qemu 2.10
- GPU: passthrough
- NIC: SRIOV (VF)
- VM size : A800 8, RoCE 200Gb 4 (for computing), RoCE 100Gb * 1(for management)
We have mapped the topo in VM close to host, GPU0/1 and mlx5_1 in the same PCIe switch, GPU2/3 and mlx5_2 in the same PCIe switch, etc.
We tested three scenarios:
- ib_bw_write for all pairs of NIC within two nodes(VMs)
- nccl all-reduce in single vm
- nccl all-reduce between two nodes(VMs)
Scenario 1 & 2 can reach almost the same with bare metal(no qemu virtualization involved), but the third case dropped almost 25%, with only 76GB/s bus bw, which reaches 96GB/s in bare metal.
We don't have a clue to dig more deeper for the result. Any idea to debug?
Attach a nccl trace log, hope it's helpful. nccl-test.log
For 1. did you run the test bidirectionally? You would need to have traffic in both directions to mimic NCCL usage.
Regarding inter-node communication, to get peak performance within a VM, you need to enable ATS so that NIC-GPU traffic doesn't go to the CPU root complex. Was that done?
Yes. Besides, we got big progress today, by setting NCCL_NET_GDR_READ to 0, NCCL all-reduce can reach 96GB/s, same peak with host. But we got confused with this. Thought NCCL_NET_GDR_READ with 1(enabled) would have better performance. Do you know why?
Interesting. Disabling GDR for reads means there will be traffic (only in one direction) between the GPU and the CPU; it will use more PCI bandwidth, and put more pressure on CPU memory. But that means you have a performance issue with GDR reads.
Did you run the ib_bw_write tests on CPU memory or GPU memory? (i.e. did you use the CUDA-enabled IB perf tests?) You should be able to see issues with reads if you use GPU memory. It could be a problem of tuning the right amount of read buffers, as discussed here: https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/2791440385/GPUDirect+Benchmarking#Adapter-Firmware-Configuration 200G NICs usually need a value of 44 to perform at full speed.
Yeah,before the test,we did this:
mlxconfig -d <pci_addr> --yes set ADVANCED_PCI_SETTINGS=1
mlxconfig -d <pci_addr> --yes set MAX_ACC_OUT_READ=44
As for the ib_write_bw test, we have done it and it can reach the best performance by configuring the number of QPs. ib_write_bw.log ib_write_bw_with_cuda_faster.log ib_write_bw_with_cuda_slower.log
Can you try running NCCL tests with NCCL_IB_QPS_PER_CONNECTION=2 NCCL_IB_SPLIT_DATA_ON_QPS=0
?
Can you try running NCCL tests with
NCCL_IB_QPS_PER_CONNECTION=2 NCCL_IB_SPLIT_DATA_ON_QPS=0
?
We found NCCL_NET_GDR_READ set to 1 and NCCL_NET_GDR_LEVEL to 3 also can reach full speed between two VMs. It's much more reasonable settings.
But we met new problems when adding the third VM into the test group. Each two of the three VMs can reach full speed with NCCL all_reduce test, but drop to 65GB/s when running all of them together. The logs are attached, do you have some insights about this? (We got trapped :<)
NCCL_NET_GDR_READ=1 and NCCL_NET_GDR_LEVEL=3 will incur additional pressure on the CPU memory. That can have a performance impact on real applications. Hence my question about other settings which would not have such negative impact.
I'm realizing the experiments on 2 VMs did not set NCCL_ALGO=RING. So you were not measuring the real network bandwidth if NCCL was using the Tree algorithm; on 2 nodes, the amount of traffic on the network is not the reported BusBw. Maybe the problem is simply that ATS is not enabled, hence all traffic goes back to the CPU and your bandwidth is halved.
NCCL_NET_GDR_READ=1 and NCCL_NET_GDR_LEVEL=3 will incur additional pressure on the CPU memory. That can have a performance impact on real applications. Hence my question about other settings which would not have such negative impact.
I'm realizing the experiments on 2 VMs did not set NCCL_ALGO=RING. So you were not measuring the real network bandwidth if NCCL was using the Tree algorithm; on 2 nodes, the amount of traffic on the network is not the reported BusBw. Maybe the problem is simply that ATS is not enabled, hence all traffic goes back to the CPU and your bandwidth is halved.
I just checked ATS are enabled on mlx cards.
And, yes, on 2 VMs we set NCCL_ALGO=Tree. RING had a much lower BusBw. Does it mean that we actually didn't reach the ideal network bandwidth on 2 VMs?
BTW, you mentioned Ring should reflect the network bandwidth when there were just two VMs. We went back to check the topo injected in VM, and adjusted it to the hierarchy below:
Now, Ring and Tree can both reach 65GB/s in three VMs test. Seems much more reasonable compared to the previous result, in which case Ring just got 20GB/s. But 65GB/s still isn't the ideal result. Could it be the topo problem(topo is also attached), which cannot emulate pcie switch exactly like in the host?
It could be due to many reasons. One of them being that ATS (despite being enabled on the NIC) is not being used, as it needs many things to align to be actually used.
When running ib_write_bw, did you run 8 instances in parallel to have all 8 NICs running bidirectionally, full speed, reading and writing to the GPU memory (i.e. using the CUDA-enabled version of the IB perftests)? Running it that way ensures it's equivalent to what NCCL does and helps figuring out whether it's a NCCL issue or an IB virtualization issue.
It's the result of running 4 instances in parallel to have all 4 NICs running bidirectionally, almost the same speed with running directly in the host. Does this mean the problem is in NCCL, maybe the wrong topology or params?
It's the result of running 4 instances in parallel to have all 4 NICs running bidirectionally, almost the same speed with running directly in the host. Does this mean the problem is in NCCL, maybe the wrong topology or params?
Also attach the topo we built in the VM. Host.
VM.
Can you share the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV
? That would help me see if the topology injection didn't work as expected.
Can you share the log with
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV
? That would help me see if the topology injection didn't work as expected.
Here are logs running on two virtual machines, using two algorithms. nccl-vm-use-ring-algo.log nccl-vm-use-tree-algo.log
@sjeaugey Do you find something wrong from NCCL output? We also captured a lot pause frame from NIC side when running all-reduce, but everything works on well in GDR test(perftest with --use_cuda).
Sorry for the delay. Everything seems normal in the log: we find the 4 NICs and the 4 paths running at 24GB/s.
I can only imagine that the ACS/ATS system is causing a bottleneck, or the additional PCI latency is causing network issues. Not sure why you didn't see that with low-level IB perftests though. I'd encourage you to contact the networking support, they might be able to better help you ensure everything is setup correctly.
@joydchh Were you able to resolve the issue? If yes, could you please share the root cause and solution.
Hello,
We are trying to set up the VM environment for A800, the configuration is as follows:
We have mapped the topo in VM close to host, GPU0/1 and mlx5_1 in the same PCIe switch, GPU2/3 and mlx5_2 in the same PCIe switch, etc.
We tested three scenarios:
Scenario 1 & 2 can reach almost the same with bare metal(no qemu virtualization involved), but the third case dropped almost 25%, with only 76GB/s bus bw, which reaches 96GB/s in bare metal.
We don't have a clue to dig more deeper for the result. Any idea to debug?