Mellanox / k8s-rdma-sriov-dev-plugin

Kubernetes Rdma SRIOV device plugin
Apache License 2.0
109 stars 27 forks source link

RDMA_CM failure in sriov mode #22

Closed EdwardZhang88 closed 4 years ago

EdwardZhang88 commented 4 years ago

Based on the document, RDMA_CM should work in sriov mode. However, I am not able to run ib_write_bw -R test while the normal ib_write_bw test is OK. Below is the message I got when running ib_write_bw -R.

`

Container1

ib_write_bw -R

ethtool -i eth0

driver: mlx5_core version: 4.2-1.2.0 ...

Container2(on the same node as Container1's)

ib_write_bw -R 10.16.190.11(IP addr of eth0 in Container1) rdma_resolve_route failed Unable to perform rdma_client function Unable to init the socket connection`

What could be the cause?

moshe010 commented 4 years ago
  1. Does it works without RDMA_CM (if you use the -x and -d ?
  2. What kernel are you using? Do you use Mellanox OFED if so what version?
  3. do you have ping between 2 containers?
EdwardZhang88 commented 4 years ago

Thanks for the reply. @moshe010

  1. Yes, it works fine without -R option. Actually, the device reported is mlx5_9 if I dont't specifiy -d while kubelet assigned mlx5_2 and mlx5_3 to the containers. I even tried with different device name and it still works. It seems there is lack of isolation in terms of device visibility inside containers.
  2. Host kernel is 3.10.0-693.21.1 and OFED version is 4.4-2.0.7. But ethtool -i eth0 in the container says the version is 4.2-1.2.0 though.
  3. Yes, containers are pingable from either end.

Do you think if sriov is correctly set up in my case?

moshe010 commented 4 years ago

regarding issue 2, how come you have diffrent ofed version in the container and in the host can you please make sure that both build with the same OFED driver ( in the container you just need to install the userspace packages not need for kernel package I think there is a flag --userspace when you install ofed) another questions:

  1. does rping working between 2 containers?
  2. if you run ib_write_bw -R & (in background) and ib_write_bw -R 127.0.0.1 (in the same container) does it works?
EdwardZhang88 commented 4 years ago

I know what the problem is now. I simply forgot to reboot the server after the upgrade installation of OFED 4.4 from 4.2. Now that OFED 4.4 takes effect, both rping and ib_write_bw -R work fine now in the containers. Thanks for the help.