NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 826 forks source link

NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464

Open dobiup opened 1 month ago

dobiup commented 1 month ago

Hi, I need your help seriously.

First, NVIDIA HW/SW component for intra-node and inter-node communication are working well as point of compatibility, enabled. NVSwitch (NVlink, Fabric Manager), GPU (GPU Driver, CUDA, NCCL), HCA(OFED, nv_peer_mem or nvidia_peermem) and so on.

NCCL 2.23.4 for CUDA 12.2 on ProxMox Hypervisor. Ubuntu 22.04.5 GPU Driver for DC 535.183.06 Mellanox OFED 23.10-3.2.2. nv_peer_mem 535.183.06 (Assembled in GPU driver) : installed manually too.

InfiniBand RDMA performance is also good (Not GPU Direct RDMA), the same as BareMetal environment according to IB_send, IB_receive BW, Latency results. The problem is the efficient GPU Direct RDMA (PIX, PHX, PXB) does not work in NCCL due to the intra-node topology recognized by the VM, as explained below. As you know, GPURDMA has performance effects only when the HCA of GPU memory is transmitted through the PCIe switch.

NCCL log (As you can see, GPU Direct RDMA was enabled, but then disabled due to topology differences.) Enabled Image Disabled Image

So, we checked by expanding GPU Direct RDMA to system memory copy level (via CPU – which does not have a significant performance advantage) NCCL_NET_GDR_LEVEL=SYS : NCCL log Image GPU Direct RDMA via system memory copy Image

Currently, what we most want is the same intro-node topology (nvidia-smi topo -m) as BareMetal, But, with the help of NVIDIA, we want to make GPU Direct RDMA works at the PIX, PXB, and PHB levels first.

The key is how to configure a VM environment (GPU Affinity with CPU Affinity) that allows topology (PIX, PHB, PXB) in GPU Direct RDMA.

VM intra topology Image

BM intra topology Image

mpirun \ -np 16 \ -N 8 \ --bind-to socket \ --hostfile /home/singtel/nccl-tests/hosts-2 \ -x NCCL_IB_CUDA_SUPPORT=1 \ -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -x LD_LIBRARY_PATH \ -x NCCL_UCX_TLS=rc_x,cuda_copy \ -x NCCL_UCX_RNDV_THRESH=0 \ -x UCX_MEMTYPE_CACHE=n \ -x NCCL_COLLNET_ENABLE=0 \ -x NCCL_PLUGIN_P2P=ucx \ -x NCCL_DEBUG=info \ -x NCCL_DEBUG_SUBSYS=NET \ -x NCCL_IB_HCA=mlx5 \ /home/singtel/nccl-tests/build/all_reduce_perf -b 128 -e 16G -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1

Image Image

sjeaugey commented 1 month ago

So, are you asking how to inject a topology inside NCCL?

Would that comment help: https://github.com/NVIDIA/nccl-tests/issues/86#issuecomment-848939744

dobiup commented 1 month ago

Thank you for update.

Yes, Exactly what I want. Where can I get the NCCL_TOPO_FILE ? Only VM provider (ProxMox or KVM) can provide this?

dobiup commented 1 month ago

Hi, SJ

Where can I find the NCCL topology xml file? I couldn't find it from below?

NCCL_TOPO_FILE (since 2.6) Path to an XML file to load before detecting the topology. By default, NCCL will load /var/run/nvidia-topologyd/virtualTopology.xml if present.

// Try default XML topology location
NCCLCHECKGOTO(ncclTopoGetXmlFromFile("/var/run/nvidia-topologyd/virtualTopology.xml", xml, 0), ret, fail);
sjeaugey commented 1 month ago

If your cloud provider doesn't provide one (or if you are launching your VM yourself), you'd need to write it, based on the physical topology.