Mellanox / nv_peer_memory

305 stars 61 forks source link

GPUDirect RDMA is not working inside the horovod-docker #41

Open vilmara opened 6 years ago

vilmara commented 6 years ago

hi all, I am running TensorFlow benchmarks inside the horovod-docker to evaluate the models in distributed mode. I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:

Outside the docker: service nv_peer_mem status Output ● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time. Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled) Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago Docs: man:systemd-sysv-generator(8) Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS) Tasks: 0 Memory: 0B CPU: 0

Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.... Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK

Inside the docker: service nv_peer_mem status Output: nv_peer_mem: unrecognized service

Also, when I run the benchmarks inside the docker, the scaling efficiency drops from ~90% to ~77%. The systems releases this warning: host-1-V100:24:203 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] host-1-V100:24:203 [0] INFO Using internal Network Socket

Can you help to find out how to fix it? also what are the mpirun flags to enable rmda (infiniband) and be sure the network communication is over rmda (infiniband) instead of the socket?

haggaie commented 6 years ago

I'm not sure about the efficiency inside docker, but regarding nv_peer_mem, it is only required to be loaded once, on the host. You don't need to load it inside a container too.

boriskovalev commented 6 years ago

@vilmara As Haggai commented you need to load nv_peer_mem only on host. Please sure that GPUs and Mellanox card connection is traversing a single PCIe switch (PIX) by running nvidia-smi topo -m ? Do you use IB or RoCE? For IB you can use my community document https://community.mellanox.com/docs/DOC-3083 (without GPUDirect). I will add GPUDirect part in next week. For RoCE please use host network.

vilmara commented 6 years ago

Hi @haggaie/ @boriskovalev thanks for your reply.

@boriskovalev, I am using IB, here is the output when runningnvidia-smi topo -m:

    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity

GPU0 X NV2 NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20 ,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38 GPU1 NV2 X NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38 GPU2 NV2 NV2 X NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20 ,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38 GPU3 NV2 NV2 NV2 X SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38 mlx5_0 SYS SYS SYS SYS X

I was able to modify the horovod Dockerfile and build it with the MLNX_OFED included, I have run the benchmarks but the system hangs and also is showing socket and infiniband connections: Outputs: c4140v1001:97640:97815 [0] INFO NET : Using interface ib0:192.168.11.1<0> c4140v1001:97640:97815 [0] INFO NET/IB : Using interface ib0 for sideband communication c4140v1001:97640:97815 [0] INFO NET/IB: [0] mlx5_0:1/IB c4140v1001:97640:97815 [0] INFO Using internal Network IB c4140v1001:97640:97815 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384 c4140v1001:97640:97815 [0] INFO NET : Using interface ib0:192.168.11.1<0> c4140v1001:97640:97815 [0] INFO NET/Socket : 1 interfaces found NCCL version 2.2.12+cuda9.0

C4140-V100-2:375816:376026 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:375816:376026 [1] INFO Using internal Network Socket C4140-V100-2:375816:376026 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375817:376020 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:375817:376020 [2] INFO Using internal Network Socket C4140-V100-2:375817:376020 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375815:376019 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:375815:376019 [0] INFO Using internal Network Socket C4140-V100-2:375815:376019 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375818:376025 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:375818:376025 [3] INFO Using internal Network Socket C4140-V100-2:375818:376025 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384

Could you please share the commands you used in Appendix A: TensorFlow Benchmarks and TCP vs. RDMA comparison https://community.mellanox.com/docs/DOC-3083