Open qianxiaoliang opened 6 days ago
That message means that CUDA refused to enable P2P between the two GPUs. This is something that NCCL has no control on. You should be able to reproduce that outside of NCCL, running basic CUDA P2P examples.
Then you can ask the NVIDIA support on how to solve that problem (e.g. through nvidia developer)
I created a virtual machine on a bare-metal server with 8 A800 GPU cards. The virtual machine has 4 GPU cards attached. The XML topo file exported using the environment variable NCCL_TOPO_DUMP_FILE is as follows:
In the virtual machine, I executed the command: all_reduce_perf -b 16m -e 256m -f 2 -g 2. The 0 and 1 GPUs correspond to the same NUMA node on the bare-metal server. In the virtual machine, their relationship is PHB by nvidia-smi topo -m, and the communication between the two GPUs via SHM/direct/direct instead of P2P. The log shows the following output: "P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.