NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.19k stars 804 forks source link

Question: when SRIOV is enabled on DGX-like GPU servers, Does GPUDirect work normally? #603

Open whisper-wind17 opened 2 years ago

whisper-wind17 commented 2 years ago

Question: when SRIOV is enabled on serveral DGX-like GPU servers, Does GPUDirect, including GPUDirect P2P and GPUDirect RDMA, work normally?

Background: In a kubernetes cluster, every GPU server has a 1Gb ethernet NIC and a 100Gb Mellanox CX5 NIC. All DGX-like GPU servers are interconnected via Ethernet and RoCE network. The RoCE network is used as the communication network between workers in a distributed training job. When P2P is enabled, NCCL_P2P_DISABLE=0, training jobs sometimes hangs, but when P2P is disabled, NCCL_P2P_DISABLE=1, training jobs does work normally. I don’t know why? Does GPUDirect work normally when SRIOV is enabled?

Thanks a lot for your time.

sjeaugey commented 2 years ago

I'm not a PCI expert, but here is my understanding ...

When VT-d/ACS is enabled, PCI-to-PCI transfers (which all GPU Direct technologies rely on) are rerouted to the CPU PCI Root Complex. That can cause a performance reduction in some case, but more importantly, if no VM system (e.g. kvm) has configured the RC correctly, the RC might not be configured to properly route those packets, which causes hangs or crashes. Arguably the linux kernel should always configure the routing to keep things functional when it detects ACS is enabled, but it's unfortunately not the case.

That's why on baremetal systems, we always advise to disable VT-d/ACS.

I'm not sure how SRIOV is related to VT-d, whether SRIOV somehow implies VT-d or if they are entirely orthogonal. But usually when using SRIOV, the goal is to use virtual machines, hence enable VT-d, hence load a VM system like kvm which should make GPU Direct functional and should not cause crashes or hangs.

lileidev commented 2 years ago

I think both VT-d(or AMD IOMMU) and SR-IOV are IO virtualization technology. VT-d allows pass-through one physical device to one guest OS, and SR-IOV allows one physical device split into multi virtual devices and pass them into multi guest OS. As I know, ACS should be disabled to avoid data to be routed to the CPU Root Complex on a system with pcie switch. But why VT-d is also related with this?

apoorvemohan commented 1 year ago

NOTE: Some vendors do not give an explicit option in the BIOS to disable ACS while the VT-d is enabled - and ACS can only be disabled when VT-d is disabled - even disabling ACS from the OS using setpci command might not work.