NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.96k stars 756 forks source link

Why can't two GPUs in a virtual machine communicate using P2P? #1329

Open qianxiaoliang opened 6 days ago

qianxiaoliang commented 6 days ago

I created a virtual machine on a bare-metal server with 8 A800 GPU cards. The virtual machine has 4 GPU cards attached. The XML topo file exported using the environment variable NCCL_TOPO_DUMP_FILE is as follows:

<system version="1">
  <cpu numaid="-1" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="134">
    <pci busid="0000:00:09.0" class="0x030200" vendor="0x10de" device="0x20f5" subsystem_vendor="0x10de" subsystem_device="0x1799" link_speed="" link_width="0">
      <gpu dev="0" sm="80" rank="0" gdr="1"/>
    </pci>
    <pci busid="0000:00:0a.0" class="0x030200" vendor="0x10de" device="0x20f5" subsystem_vendor="0x10de" subsystem_device="0x1799" link_speed="" link_width="0">
      <gpu dev="1" sm="80" rank="1" gdr="1"/>
    </pci>
    <pci busid="0000:00:0b.0" class="0x030200" vendor="0x10de" device="0x20f5" subsystem_vendor="0x10de" subsystem_device="0x1799" link_speed="" link_width="0">
      <gpu dev="2" sm="80" rank="2" gdr="1"/>
    </pci>
    <pci busid="0000:00:0c.0" class="0x030200" vendor="0x10de" device="0x20f5" subsystem_vendor="0x10de" subsystem_device="0x1799" link_speed="" link_width="0">
      <gpu dev="3" sm="80" rank="3" gdr="1"/>
    </pci>
    <nic>
      <net name="ens4" dev="0" speed="10000" port="0" latency="0.000000" guid="0x0" maxconn="65536" gdr="0"/>
    </nic>
  </cpu>
</system>

In the virtual machine, I executed the command: all_reduce_perf -b 16m -e 256m -f 2 -g 2. The 0 and 1 GPUs correspond to the same NUMA node on the bare-metal server. In the virtual machine, their relationship is PHB by nvidia-smi topo -m, and the communication between the two GPUs via SHM/direct/direct instead of P2P. The log shows the following output: "P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

sjeaugey commented 6 days ago

That message means that CUDA refused to enable P2P between the two GPUs. This is something that NCCL has no control on. You should be able to reproduce that outside of NCCL, running basic CUDA P2P examples.

Then you can ask the NVIDIA support on how to solve that problem (e.g. through nvidia developer)