Mellanox / k8s-rdma-sriov-dev-plugin

Kubernetes Rdma SRIOV device plugin
Apache License 2.0
109 stars 27 forks source link

can not use rdma_client, when using hca mode with calico #18

Closed asdfsx closed 5 years ago

asdfsx commented 5 years ago

To use hca mode with calico, I add these settings

            - name: IP_AUTODETECTION_METHOD
              value: "interface=enp175s0"
            - name: IP6_AUTODETECTION_METHOD
              value: "interface=enp175s0"

After create the whole network, I try to do connectivity test using rdma_server/rdma_client. so I create to 2 pod first

apiVersion: v1
kind: Pod
metadata:
  name: iperf-server
spec:  # specification of the pod's contents
  containers:
  - name: iperf-server
    image: "asdfsx/mofed_benchmark"
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/hca: 1
    command: ["/bin/bash", "-c", "sleep 2000000000000"]
    stdin: true
    tty: true
---
apiVersion: v1
kind: Pod
metadata:
  name: iperf-client-1
spec:  # specification of the pod's contents
  containers:
  - name: iperf-client-1
    image: "asdfsx/mofed_benchmark"
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/hca: 1
    command: ["/bin/bash", "-c", "sleep 2000000000000"]
    stdin: true
    tty: true

The image asdfsx/mofed_benchmark is builded by using this dockerfile

Then start the rdma_server

$ kubectl exec -it iperf-server -- rdma_server
rdma_server: start

Start the rdma_client, but got error

$ kubectl exec -it iperf-client-1 -- rdma_client -s 10.244.0.8
rdma_client: start
rdma_create_ep: No such device
rdma_client: end -1
command terminated with exit code 255

I want to know why this happen.I'm totally confused~~~

asdfsx commented 5 years ago

@paravmellanox We have tried HCA mode with RoCE adapter (thus we do not configure IPoIB) and we found the QP in container cannot establish RDMA connection:

[root@iperf-client-1 tmp]# ib_write_bw -d mlx5_0 &
[1] 208
[root@iperf-client-1 tmp]# 
************************************
* Waiting for client to connect... *
************************************

[root@iperf-client-1 tmp]# ib_write_bw -d mlx5_0 localhost &
[2] 209
[root@iperf-client-1 tmp]# 
[root@iperf-client-1 tmp]# ---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 0
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x011b PSN 0x46059f RKey 0x0036a2 VAddr 0x007f1542c3c000
 GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 0
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x011c PSN 0x88e426 RKey 0x00bab8 VAddr 0x007f33a21fb000
 GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
 remote address: LID 0000 QPN 0x011b PSN 0x46059f RKey 0x0036a2 VAddr 0x007f1542c3c000
 GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
Failed to modify QP 284 to RTR
 Unable to Connect the HCA's through the link
 remote address: LID 0000 QPN 0x011c PSN 0x88e426 RKey 0x00bab8 VAddr 0x007f33a21fb000
 GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
Failed to modify QP 283 to RTR
 Unable to Connect the HCA's through the link

Is this configuration supported?

paravmellanox commented 5 years ago

@asdfsx , which MOFED version did you use in the host, and what type of virtual netdevices are configured by Calico? Currently macvlan netdevices are supported by the kernel stack. If you use vxlan overlay or others, they are not supported. Please check.

asdfsx commented 5 years ago

@paravmellanox Thx for reply. The version of MOFED: MLNX_OFED_LINUX-4.4-2.0.7.0-ubuntu16.04-x86_64. And how to check type of virtual netdevices are configured by Calico? Do you mean type of interface "enp175s0", in my case? or type of interface created by calico, "cali25673b519fb" in my case?

paravmellanox commented 5 years ago

@asdfsx in your case its cali25673b519fb. you can do ethtool -i cali25673b519fb, it will show the driver who owns the interface. you can search for macvlan cni plugin for K8s and that will likely work, though I haven't tried it.

asdfsx commented 5 years ago

The result of ethtool -I cali25673b519fb

driver: veth
version: 1.0
firmware-version: 
expansion-rom-version: 
bus-info: 
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

so the type veth doesn't support right now?

paravmellanox commented 5 years ago

@asdfsx , no veth are unsupported.

asdfsx commented 5 years ago

The interface cali25673b519fb is created automatically by calico cni. And the interface enp175s0 is a Mellanox Converged Network Adapters

$ lspci -v|grep Mell -A 5
af:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
    Subsystem: Mellanox Technologies MT28800 Family [ConnectX-5]
    Flags: bus master, fast devsel, latency 0, IRQ 41
    Memory at dc000000 (64-bit, prefetchable) [size=32M]
    Expansion ROM at ee600000 [disabled] [size=1M]
    Capabilities: [60] Express Endpoint, MSI 00
    Capabilities: [48] Vital Product Data
$ dmesg|grep af:00.0
...
[   12.997251] mlx5_core 0000:af:00.0 enp175s0: renamed from eth0
...

By the way, in my calico config, I set IP-in-IP to always. would this config affect?

paravmellanox commented 5 years ago

@asdfsx I understand, instead of calico, you should use macvlan cni where those virtual devices are child of enp175s0. RoCE can make use of those netdevices.

asdfsx commented 5 years ago

@paravmellanox mtacvlan work perfectly with hca. But unfortunately, macvlan can not work perfectly with iptables. and in k8s services depend iptables heavily. So generally we don't use macvlan

paravmellanox commented 5 years ago

@asdfsx good to know that macvlan worked for you. Which plugin did you use? Sure, you can use iptables with default Kubernetes managed veth and for RDMA you can use this additional plugin. Other users are using multus plugin, which allows you to have multiple netdev interfaces in a Pod. Such as first managed default veth interface via your existing plugin, and second macvlan or sriov interface via 2nd cni. This way you get both of both world for performance and functionality.

asdfsx commented 5 years ago

We use Multus right now, just like you said. Thx for your help. and I'll close this issue!

tingweiwu commented 5 years ago

@asdfsx could you tell me how to use Multus?

asdfsx commented 5 years ago

@tingweiwu just follow the multus's example. It's not so hard

goversion commented 3 years ago

@asdfsx @tingweiwu @paravmellanox @yshestakov

I also encountered the same problem, how did you solve it, please ???

the error is : Failed to modify QP 283 to RTR Unable to Connect the HCA's through the link

My k8s environment has multus and calico installed, and these two plugins are running normally, but there is still the above error. I don't think it will be useful after installing multus. please, What is the purpose of installing multus in hca mode?