Open apoorvemohan opened 5 months ago
Re. topo1 vs topo2: 200Gbps matches the bandwidth of PCI Gen4 x16. Therefore, NCCL will only use one NIC in that situation. So, reducing the bandwidth to 100Gbps makes NCCL use both interfaces instead of just one. Using two ports, NCCL also needs to use 2x the number of SMs, which may bring extra performance in benchmarks, but may also significantly degrade the application's performance as more SMs will be used for NCCL and not compute.
Topo3 seems to have bad performance because GPU Direct RDMA is not detected as being present. It seems the same happens with NCCL 2.20 and 2.21 as well causing bad performance in all cases.
Topo4 seems to just change the port= definition of the NICs which causes issues it seems. I'm not sure what you tried to do with the port definition though, given those port=X
definitions work together with the guid=X
definition which is not shown here (I'd need the topology dump, or the log with NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH) to see how NCCL will construct the node topology graph.
My first question would be: what is the real speed of each port? You mentioned it's a dual port NIC; is each port running at 100G or 200G? You should reflect the reality in the topo file. You should also set guid
as a different value for each port, and set port="1"
always. The "port/guid" definition is not useful for your use case, it is there for cases of a NIC with multiple PCI attachments.
In recent versions, NCCL may try to fuse ports together (2x100G as 1x200G), but that is based on the topology detection inside the network plugin, and the topo injection can't affect that fusion, so I'm not sure that would be possible unless the PCI IDs show each port as a PCI subdevice (e.g. .0000:08:02.0
and 0000:08:02.1
). In other words, you'd want to force KVM to set specific PCI IDs to enable port fusion.
Hi, we are seeing unexpected All Reduce performance with GPU Direct + ROCE within KVM+QEMU VMs. The 8xH100 GPUs and the 8xDual port CX7-based SRIOV VFs (total 2x8 VFs) are passed to the VM using VFIO -- we ensure that VFs from same NICs are presented as separate ports and devices are attached to the appropriate NUMA domains inside the NCCL topology file. We tried different versions of NCCL and modified NCCL topology file. The best performance was achieved with NCCL 2.19.3 and when we present each SRIOV VF as 100G capable instead of original 200G (see attached
topo1.xml
).Attaching
lspci
output,nvidia-smi
output, NCCL topology files, NCCL Graph Files, and NCCL log file for each of the above run.logs-nccl-2.19.3-1-topo1.txt logs-nccl-2.19.3-1-topo2.txt logs-nccl-2.19.3-1-topo3.txt logs-nccl-2.19.3-1-topo4.txt logs-nccl-2.20.5-1-topo1.txt logs-nccl-2.20.5-1-topo2.txt logs-nccl-2.20.5-1-topo3.txt logs-nccl-2.20.5-1-topo4.txt logs-nccl-2.21.5-1-topo1.txt logs-nccl-2.21.5-1-topo2.txt logs-nccl-2.21.5-1-topo3.txt logs-nccl-2.21.5-1-topo4.txt lspci-vt.txt nvidia-smi.txt graph-nccl-2.19.3-1-topo1.txt graph-nccl-2.19.3-1-topo2.txt graph-nccl-2.19.3-1-topo3.txt graph-nccl-2.19.3-1-topo4.txt graph-nccl-2.20.5-1-topo1.txt graph-nccl-2.20.5-1-topo2.txt graph-nccl-2.20.5-1-topo3.txt graph-nccl-2.20.5-1-topo4.txt graph-nccl-2.21.5-1-topo1.txt graph-nccl-2.21.5-1-topo2.txt graph-nccl-2.21.5-1-topo3.txt graph-nccl-2.21.5-1-topo4.txt topo1.txt topo2.txt topo3.txt topo4.txt