All Reduce Performance on H100 VMs

Hi, we are seeing unexpected All Reduce performance with GPU Direct + ROCE within KVM+QEMU VMs. The 8xH100 GPUs and the 8xDual port CX7-based SRIOV VFs (total 2x8 VFs) are passed to the VM using VFIO -- we ensure that VFs from same NICs are presented as separate ports and devices are attached to the appropriate NUMA domains inside the NCCL topology file. We tried different versions of NCCL and modified NCCL topology file. The best performance was achieved with NCCL 2.19.3 and when we present each SRIOV VF as 100G capable instead of original 200G (see attached topo1.xml).

NCCL Version | NCCL Topology | Peak Performance
===============================================

2.19.3       | topo1         | ~365 GB/s
2.19.3       | topo2         | ~193 GB/s
2.19.3       | topo3         | ~51 GB/s
2.19.3       | topo4         | internal error

2.20.5       | topo1         | ~41 GB/s
2.20.5       | topo2         | ~41 GB/s
2.20.5       | topo3         | ~41 GB/s
2.20.5       | topo4         | ~41 GB/s

2.21.5       | topo1         | ~40 GB/s
2.21.5       | topo2         | ~40 GB/s
2.21.5       | topo3         | ~40 GB/s
2.21.5       | topo4         | ~40 GB/s

Attaching lspci output, nvidia-smi output, NCCL topology files, NCCL Graph Files, and NCCL log file for each of the above run.

A few questions/observations from the graph files:
- Graph file format appears to have changed between NCCL version 2.20.5 and 2.21.5 (e.g., from using decimal format for device id to using hexadecimal format). Do we need to change the NCCL topology format as well?

- For NCCL version 2.19.3, why does the number of channels reduce from 16 to 8 when moving from topo1.txt to topo2.txt? Can you please point to the NCCL code location where the number of channels to be used are calculated? Also, why does reducing the number of channels (from 16 to 8) as the link speed is increased (from 100 to 200), results in lower performance? 

- Note that in most of the attached NCCL topology files we use a minimal topology where we do not specify the NV Bridges (only the GPUs and SRIOV VFs). Will that impact performance with newer NCCL versions?

logs-nccl-2.19.3-1-topo1.txt logs-nccl-2.19.3-1-topo2.txt logs-nccl-2.19.3-1-topo3.txt logs-nccl-2.19.3-1-topo4.txt logs-nccl-2.20.5-1-topo1.txt logs-nccl-2.20.5-1-topo2.txt logs-nccl-2.20.5-1-topo3.txt logs-nccl-2.20.5-1-topo4.txt logs-nccl-2.21.5-1-topo1.txt logs-nccl-2.21.5-1-topo2.txt logs-nccl-2.21.5-1-topo3.txt logs-nccl-2.21.5-1-topo4.txt lspci-vt.txt nvidia-smi.txt graph-nccl-2.19.3-1-topo1.txt graph-nccl-2.19.3-1-topo2.txt graph-nccl-2.19.3-1-topo3.txt graph-nccl-2.19.3-1-topo4.txt graph-nccl-2.20.5-1-topo1.txt graph-nccl-2.20.5-1-topo2.txt graph-nccl-2.20.5-1-topo3.txt graph-nccl-2.20.5-1-topo4.txt graph-nccl-2.21.5-1-topo1.txt graph-nccl-2.21.5-1-topo2.txt graph-nccl-2.21.5-1-topo3.txt graph-nccl-2.21.5-1-topo4.txt topo1.txt topo2.txt topo3.txt topo4.txt

Re. topo1 vs topo2: 200Gbps matches the bandwidth of PCI Gen4 x16. Therefore, NCCL will only use one NIC in that situation. So, reducing the bandwidth to 100Gbps makes NCCL use both interfaces instead of just one. Using two ports, NCCL also needs to use 2x the number of SMs, which may bring extra performance in benchmarks, but may also significantly degrade the application's performance as more SMs will be used for NCCL and not compute.

Topo3 seems to have bad performance because GPU Direct RDMA is not detected as being present. It seems the same happens with NCCL 2.20 and 2.21 as well causing bad performance in all cases.

Topo4 seems to just change the port= definition of the NICs which causes issues it seems. I'm not sure what you tried to do with the port definition though, given those port=X definitions work together with the guid=X definition which is not shown here (I'd need the topology dump, or the log with NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH) to see how NCCL will construct the node topology graph.

My first question would be: what is the real speed of each port? You mentioned it's a dual port NIC; is each port running at 100G or 200G? You should reflect the reality in the topo file. You should also set guid as a different value for each port, and set port="1" always. The "port/guid" definition is not useful for your use case, it is there for cases of a NIC with multiple PCI attachments.

In recent versions, NCCL may try to fuse ports together (2x100G as 1x200G), but that is based on the topology detection inside the network plugin, and the topo injection can't affect that fusion, so I'm not sure that would be possible unless the PCI IDs show each port as a PCI subdevice (e.g. .0000:08:02.0 and 0000:08:02.1). In other words, you'd want to force KVM to set specific PCI IDs to enable port fusion.

NVIDIA / nccl

All Reduce Performance on H100 VMs #1303