flymark2010 commented 6 years ago

Hi, have you tested the plugin with Nvidia GPU? I found that when using NCCL to test GPU communication with the plugin, the test program will hang. Below is the detail of my test:

Environment: OS: Ubuntu 16.04
kubelet: 1.10.4
NCCL version: 2.2

Test:
test code: https://github.com/NVIDIA/nccl-tests test command: NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128 -f 2 -g 2 . -g 2 means using 2 GPUS in the test thread. With the environment variable NCCL_DEBUG=INFO, you can find lines line INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer which means uses CUDA direct access between GPUs, using NVLink or PCI.

Result:

When the test Pod is lauched on node without the k8s-rdma-sriov-dev-plugin， the test program runs normally and gets log like this:


...
caffe:34:34 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
caffe:34:34 [1] INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
# NCCL Tests compiled with NCCL 2.2
# Using devices
#   Rank  0 on      caffe device  0 [0x04] GeForce GTX 1080 Ti
#   Rank  1 on      caffe device  1 [0x05] GeForce GTX 1080 Ti

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res

caffe:34:34 [0] INFO Launch mode Group/CGMD 8 2 float sum 0.017 0.00 0.00 0e+00 0.017 0.00 0.00 0e+00 16 4 float sum 0.017 0.00 0.00 0e+00 0.017 0.00 0.00 0e+00 32 8 float sum 0.017 0.00 0.00 0e+00 0.017 0.00 0.00 0e+00 64 16 float sum 0.017 0.00 0.00 0e+00 0.017 0.00 0.00 0e+00 128 32 float sum 0.017 0.01 0.01 0e+00 0.017 0.01 0.01 0e+00 Out of bounds values : 0 OK Avg bus bandwidth : 0.00292487


- When the test Pod is lauched on node with the `k8s-rdma-sriov-dev-plugin`, the test program will hang after pring the log:

... caffe:8833:8833 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer caffe:8833:8833 [1] INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer

NCCL Tests compiled with NCCL 2.2

Using devices

Rank 0 on caffe device 0 [0x05] GeForce GTX 1080 Ti

Rank 1 on caffe device 1 [0x08] GeForce GTX 1080 Ti

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res

caffe:8833:8833 [0] INFO Launch mode Group/CGMD

paravmellanox commented 6 years ago

@flymark2010 you should debug using


# kubectl describe <pod>
# kubectl exec -it <pod> bash
debug using gdb, where it is hanging. Since program/pod is started, its unlikely related to device plugin.

flymark2010 commented 6 years ago

The GPU p2p communication hang may be caused by the intel_iommu setting. With intel_iommu enabled, the GPU p2p communication hangs, otherwise not. And the nvidia official test p2pBandwidthLatencyTest in the CUDA package can verify that.

We use RDMA SRIOV with docker, and disabled the intel_iommu, and up to now, everything goes fine(include the GPU p2p communication and the communication via vhca). I think that may be one of the solutions.

paravmellanox commented 6 years ago

@flymark2010 Thanks for the details, its helpful. Happy to hear that you are able to use with GPU and RDMA SRIOV all together using this plugin.

davidstack commented 2 years ago

@flymark2010 when using sriov ,can we disable intel_iommu? is there any method?thanks

Mellanox / k8s-rdma-sriov-dev-plugin

CUDA direct access between GPUs using PCI can't work with the plugin. #5

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res

NCCL Tests compiled with NCCL 2.2

Using devices

Rank 0 on caffe device 0 [0x05] GeForce GTX 1080 Ti

Rank 1 on caffe device 1 [0x08] GeForce GTX 1080 Ti

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res