Mellanox / k8s-rdma-sriov-dev-plugin

Kubernetes Rdma SRIOV device plugin
Apache License 2.0
109 stars 27 forks source link

How to config the ConfigMap #1

Closed flymark2010 closed 6 years ago

flymark2010 commented 6 years ago

Hi, thanks for your work! I'm working on this recently and I got confused for the node ConfigMap. I have 7 nodes in my cluster with a Mellanox ConnectX-4 Lx device and 7 sriov vfs on each node. The systems are all Ubuntu 16.04. Typing command ifconfig, I can get like this:

ens5f0    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1c:8c  
          inet addr:10.128.1.5  Bcast:10.128.1.255  Mask:255.255.255.0
          inet6 addr: fe80::526b:4bff:fe2f:1c8c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3438707 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2557173 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3391158743 (3.3 GB)  TX bytes:2311580366 (2.3 GB)

ens5f1    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1c:8d  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

ens5f2    Link encap:Ethernet  HWaddr b6:96:dc:55:e9:df  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18515 errors:0 dropped:0 overruns:0 frame:0
          TX packets:487 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2850062 (2.8 MB)  TX bytes:83849 (83.8 KB)

ens5f3    Link encap:Ethernet  HWaddr 72:96:21:22:9e:ed  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19140 errors:0 dropped:0 overruns:0 frame:0
          TX packets:522 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2953775 (2.9 MB)  TX bytes:88006 (88.0 KB)

ens5f4    Link encap:Ethernet  HWaddr 3e:f9:0f:af:df:9e  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18753 errors:0 dropped:0 overruns:0 frame:0
          TX packets:497 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2890892 (2.8 MB)  TX bytes:83641 (83.6 KB)
ens5f5    Link encap:Ethernet  HWaddr da:2a:71:b3:9e:19  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19218 errors:0 dropped:0 overruns:0 frame:0
          TX packets:530 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2969260 (2.9 MB)  TX bytes:88424 (88.4 KB)

ens5f6    Link encap:Ethernet  HWaddr 4e:eb:0e:d5:bb:05  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18793 errors:0 dropped:0 overruns:0 frame:0
          TX packets:512 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2897280 (2.8 MB)  TX bytes:86343 (86.3 KB)

ens5f7    Link encap:Ethernet  HWaddr 6e:39:97:e5:bc:4e  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19338 errors:0 dropped:0 overruns:0 frame:0
          TX packets:523 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2990496 (2.9 MB)  TX bytes:89079 (89.0 KB)

So, how can I config the node ConfigMap? I have tried like this:

  config.json: |
    {
        "pfNetdevices": [
                "ens5f0",
                "ens5f1",
                "ens5f2",
                "ens5f3",
                "ens5f4",
                "ens5f5",
                "ens5f6",
                "ens5f7",
        ]
    }

But when I started the device plugin daemonset, there is node resource rdma/vhca in the node description. And when I try to start the test-pod.yaml, the Pod is always in Pending status because there is no sufficient node to deploy the Pod.

paravmellanox commented 6 years ago

You only need to add the PF netdevice information here. Device plugin automatically detects its child VF and uses them. Please remove child VF netdevices from the list.

paravmellanox commented 6 years ago

Also , please don't enable sriov by yourself. This device plugin enables sriov and does necessary configuration of the VF for Infiniband and RoCE depending on upstream kernel or MOFED.

paravmellanox commented 6 years ago

@flymark2010 fly I am updated documentation for same. Let me know how it goes with only PFs in the list.

flymark2010 commented 6 years ago

Sorry for no reply for so long. We've been waiting for the new OFED driver and now we have installed driver OFED 4.4, and then tried again, but still failed.

First ,I'm not sure the meaning of "don't enable sriov by yourself". I used the comand mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=9 and the reboot the system. Then I get the hca info with command ibv_devinfo:

hca_id: mlx5_1
    transport:          InfiniBand (0)
    fw_ver:             14.23.1000
    node_guid:          506b:4b03:002f:1a3d
    sys_image_guid:         506b:4b03:002f:1a3c
    vendor_id:          0x02c9
    vendor_part_id:         4117
    hw_ver:             0x0
    board_id:           MT_2420110034
    phys_port_cnt:          1
    Device ports:
        port:   1
            state:          PORT_DOWN (1)
            max_mtu:        4096 (5)
            active_mtu:     1024 (3)
            sm_lid:         0
            port_lid:       0
            port_lmc:       0x00
            link_layer:     Ethernet

hca_id: mlx5_0
    transport:          InfiniBand (0)
    fw_ver:             14.23.1000
    node_guid:          506b:4b03:002f:1a3c
    sys_image_guid:         506b:4b03:002f:1a3c
    vendor_id:          0x02c9
    vendor_part_id:         4117
    hw_ver:             0x0
    board_id:           MT_2420110034
    phys_port_cnt:          1
    Device ports:
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     1024 (3)
            sm_lid:         0
            port_lid:       0
            port_lmc:       0x00
            link_layer:     Ethernet

and the Ethernet interface info with command ifconfig:

ens5f0    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1a:3c  
          inet addr:10.128.1.16  Bcast:10.128.1.255  Mask:255.255.255.0
          inet6 addr: fe80::526b:4bff:fe2f:1a3c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:34153 errors:0 dropped:231 overruns:0 frame:0
          TX packets:8405 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:7302014 (7.3 MB)  TX bytes:7965863 (7.9 MB)

ens5f1    Link encap:Ethernet  HWaddr 50:6b:4b:2f:1a:3d  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

We have a card with two port on the node, but only one is used. Actually the hca mlx5_0 and Ethernet interface ens5f0 are active.

The content of rdma-sriov-node-config.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: kube-system
data:
  config.json: |
    {
        "mode" : "sriov",
        "pfNetdevices": [ "ens5f0" ]
    }

Then I created the device plugin DamonSet, all the DamonSet Pods run normally with status Running. Then I can see 9 virtual hca devices and all the port state is PORT_ACTIVE, same with the Ethernet interface.

But here is still no resource rdma/vhca in the node description, and the test Pod is always in Pending state with message Warning FailedScheduling 50s (x91 over 25m) default-scheduler 0/7 nodes are available: 7 Insufficient rdma/vhca..

paravmellanox commented 6 years ago

np @flymark2010. I will make the documentation more crisp instead of ""don't enable sriov by yourself". Basically rdma device plugin enables the SRIOV and does necessary rdma configuration. Therefore, user should not enable it by writing to sysfs files. What you have done to enable at HCA (firmware/hardware) level is correct.

Can you please share the output of

ip link show ens5f0

and

kubectl show logs --namespace=kube-system <pod_of_device_plugin_ds>

This will help to debug/understand why vhca resources are not published or something else went wrong.

flymark2010 commented 6 years ago

output for ip link show ens5f0:

4: ens5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 50:6b:4b:2f:1a:3c brd ff:ff:ff:ff:ff:ff
    vf 0 MAC fa:8e:ae:13:f6:8f, spoof checking off, link-state auto
    vf 1 MAC da:46:61:b7:b2:2f, spoof checking off, link-state auto
    vf 2 MAC 6a:f6:3c:f9:75:69, spoof checking off, link-state auto
    vf 3 MAC ee:f9:5e:0a:c9:1e, spoof checking off, link-state auto
    vf 4 MAC fe:8c:fe:4a:af:bb, spoof checking off, link-state auto
    vf 5 MAC 9a:4c:c5:74:7f:75, spoof checking off, link-state auto
    vf 6 MAC a2:8a:40:ee:a1:89, spoof checking off, link-state auto
    vf 7 MAC 0e:d1:77:26:c3:68, spoof checking off, link-state auto
    vf 8 MAC 72:ff:98:e1:54:9c, spoof checking off, link-state auto

Output for device plugin log is repeating with the following log:

2018/07/11 05:45:51 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2018/07/11 05:45:51 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
2018/07/11 05:45:51 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/07/11 05:45:51 sriov device mode
Configuring SRIOV on ndev= ens5f0 6
max_vfs =  9
cur_vfs =  9
vf = &{0 virtfn0 true false}
vf = &{1 virtfn1 true false}
vf = &{2 virtfn2 true false}
vf = &{3 virtfn3 true false}
vf = &{4 virtfn4 true false}
vf = &{5 virtfn5 true false}
vf = &{6 virtfn6 true false}
vf = &{7 virtfn7 true false}
vf = &{8 virtfn8 true false}

I'm sure the device plugin feature gate is setted for k8s, here is the ps result:

# ps -ef | grep kubelet
root      2082     1  5 11:10 ?        00:08:09 /usr/local/kubernetes/kubelet --address=10.128.1.16 --hostname-override=10.128.1.16 --pod-infra-container-image=10.128.2.6/kube-system/pause-amd64:3.0 --experimental-bootstrap-kubeconfig=/etc/kubernetes/bootstrap.kubeconfig --kubeconfig=/etc/kubernetes/kubelet.kubeconfig --cert-dir=/etc/kubernetes/ssl --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/usr/local/kubernetes --cluster-dns=10.0.0.2 --cluster-domain=cluster.cloudwalk. --hairpin-mode hairpin-veth --feature-gates=DevicePlugins=true --allow-privileged=true --fail-swap-on=false --logtostderr=true --v=2
root     16674 10575  0 13:46 pts/2    00:00:00 grep --color=auto kubelet
paravmellanox commented 6 years ago

@flymark2010 plugin seems to configure the VFs correctly. Feature gate is also enabled. what is the kubeadm, kubelet and kubeadm versions are you using? 1.10.3 or higher should work.

flymark2010 commented 6 years ago

The kubelet version is 1.9.0. I'll try higher kubelet version.

flymark2010 commented 6 years ago

After upgrading the kubelet version to 1.10.4, I can see the resource rdma/vhca in the node description, and the test Pod can run normally.
Thanks a lot!