NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.02k stars 763 forks source link

NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

Open cold2stone opened 2 months ago

cold2stone commented 2 months ago

Hello,

NVIDIA's official documentation mentions that NCCL_NET_GDR_READ is set to 1 by default only on NVLink-based platforms. Additionally, it notes, "Reading directly from GPU memory when sending data is known to be slightly slower than reading from CPU memory on some platforms, such as PCI-E." Really, my experiments on a PCIe platform show better performance when NCCL_NET_GDR_READ=0.

My question is this: Even on NVLink-based platforms (e.g. DGX), the RNIC and GPU are connected via PCIe, not NVLink. Then why is there a performance difference with PCIe platforms despite RNIC and GPU not being connected via NVLink? Isn't NVLink involved only in data transfer between GPUs?

Additionally, this question evolves to: What exactly makes GDR perform better than not using GDR? I suspect that the performance difference in GPU memory read between these two platforms is more about latency than bandwidth. Also, I believe that p2p communication does not affect the PCIe data transfer bandwidth between devices. Therefore, does the improvement in collective communication bandwidth brought by GDR rely solely on the reduced communication latency through p2p?

shanleo1986 commented 1 month ago

I have the same question: https://github.com/NVIDIA/nccl/issues/1181 I think NCCL_NET_GDR_READ is not the meaning of GDR, GDR is controlled by NCCL_NET_GDR_LEVEL, NCCL_NET_GDR_READ only affect the GDR of sending side. Did you test the NCCL_NET_GDR_READ=1 perform better than NCCL_NET_GDR_READ=0 on DGX platform? I cannot make sure if the NCCL_NET_GDR_READ relates with PXN or not.

cold2stone commented 1 month ago

NCCL_NET_GDR_READ only determines whether the send-side node uses GDR or not. I am not using DGX platform.

My question is, even if the pcie gen5 platform does not use GDR, the pcie bandwidth will not be the bottleneck of the system. Given that the original advantage of GDR is to solve the pcie bottleneck near cpu, I wonder why GDR makes the performance different although the PCIe bandwidth is not the bottleneck.

I guess GDR itself does not increases the network bandwidth.

shanleo1986 commented 1 week ago

Do you have any understanding about the NCCL_NET_GDR_READ=1? On my setup, let NCCL_NET_GDR_READ=1 will performance worse than NCCL_NET_GDR_READ=0 when running allreduce, allgather and reducescater. While other several test will performance better with NCCL_NET_GDR_READ=1. Cannot understand this, do you have any idea?