Why is nv_peer_memory severely deteriorating all_reduce_perf result?

EdwardZhang88 commented 5 years ago

I am running benchmark testing using nccl_test. I have 2 nodes, which are connected via RoCE. I have also installed the nv_peer_memory. However, once I turn on GPU Direct RDMA, the all_reduce_perf bandwidth gets dramatically worse than without GPU Direct RDMA. I am aware that GPU PCIe topology matters and that's why I am only using GPU0 on both nodes since GPU0 and the Mellanox HAC are connected to the same CPU. The GPU topology is

Without GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2

With GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2

According to this suggested system support, having single CPU in between GPU and the Mellanox HAC will yield worse performance. But I never expected it to be this much worse.

At this point, I am wondering if there is any tool which can help debug nv_peer_mem to make sure it really takes effect? Or maybe there is sth I misconfigured?

Here is the detail about my environment. Nvidia Tesla V100 CUDA9.0 NCCL2.2.13 OFED4.2-1.2.0 Mellanox MT27710 ConnectX-4Lx nvidia_peer_memory1.0-8

I notice that the log says that 'No module present for GPU Direct RDMA'. When I check its status, this is what it look like. Is this normal?

EdwardZhang88 commented 5 years ago

Even when the 'No module present for GPU Direct RDMA'. message is gone after I re-installed nv_peer_mem, the performance still doesn't get any better for GPU Direct RDMA case.

raph38130 commented 5 years ago

see this post : https://devblogs.nvidia.com/benchmarking-gpudirect-rdma-on-modern-server-platforms/

RDMA transfer from NIC to GPU Mem using GPUDirect is slower than RDMA from NIC to pinned CPUMEM followed by cudaMemcpy from CPU Mem to GPU Mem.

this is a PCIeDirect (Peer to Peer) issue

In my setup (connectx5, quadroP6000, RoCEv2) I have 97.4Gb/s(with intermediate step in cpumem) or 71Gb/s (GPUDirect)

Mellanox / nv_peer_memory

Why is nv_peer_memory severely deteriorating all_reduce_perf result? #50