NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
876 stars 238 forks source link

Enable P2P on pcie in a nvlink machine #250

Open cll24 opened 1 month ago

cll24 commented 1 month ago

Hi, I want to test the all_reduce_perf with p2p through PCIe in H20. However, H20 is equipped with nvlink, the NCCL all_reduce_perf always transfers data with the nvlink. How Can I get the p2p with PCIe and disable the nvlink in the test.

I tried to disable the nvlink with RMNvLinkEnable=0x0. Then the NCCL all_reduce_perf will always leverage the SHM to communicate.

kiskra-nvidia commented 1 month ago

To the best of my knowledge, there's no way for NCCL to disable just nvlink. The granularity of control is "P2P" or "no P2P".

What does nvidia-smi topo -m print after you use RMNvLinkEnable? Perhaps the GPUs are simply too far from each other on the PCIe bus? NCCL will typically not attempt P2P if devices are any further from each other than PXB.