Mellanox / k8s-rdma-sriov-dev-plugin

Kubernetes Rdma SRIOV device plugin
Apache License 2.0
109 stars 27 forks source link

Daemonset logs says `Link not found` #14

Closed harryge00 closed 5 years ago

harryge00 commented 5 years ago

I have followed the README to create configmap and daemonset, but it seems like the device plugin does not work correctly:

# kubectl logs xxxxx -n kube-system
2018/10/17 08:13:49 Reading /k8s-rdma-sriov-dev-plugin/config.json
2018/10/17 08:13:49 loaded config:  {"mode":"sriov","pfNetdevices":["enp97s0f0","enp97s0f1","enp97s0f2","enp97s0f3"]}
2018/10/17 08:13:49 sriov device mode
Configuring SRIOV on ndev= enp97s0f0 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f0
Fail to configure sriov; error =  Link not found
Configuring SRIOV on ndev= enp97s0f1 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f1
Fail to configure sriov; error =  Link not found
Configuring SRIOV on ndev= enp97s0f2 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f2
Fail to configure sriov; error =  Link not found
Configuring SRIOV on ndev= enp97s0f3 9
max_vfs =  32
cur_vfs =  0
vf = &{10 virtfn10 false false}
Fail to config vfs for ndev = enp97s0f3
Fail to configure sriov; error =  Link not found
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-3+18f677ae060806", GitCommit:"18f677ae0608064799c7e7f2bc2732d37f22efe3", GitTreeState:"clean", BuildDate:"2018-10-16T12:37:20Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"linux/amd64"}
songole commented 5 years ago

I am also seeing this issue. At times it detects all the configured VFs and at other times it fails with the same error.

moshe010 commented 5 years ago

Hi, I think we were able to reproduce the issue. we think it related to the fact that one or more VFs are in namspaces before you start/restart the device plugin Can you check how many VF network interfaces you see on the host? is it match the max num_vfs?

songole commented 5 years ago

I don't see all 32 (max) VFs. The following is the output of ibdev2netdev on host $ ibdev2netdev mlx5_0 port 1 ==> ens3f0 (Down) mlx5_1 port 1 ==> ens3f1 (Up) mlx5_10 port 1 ==> enp2s5f2 (Down) mlx5_11 port 1 ==> enp2s5f3 (Down) mlx5_12 port 1 ==> enp2s5f4 (Down) mlx5_13 port 1 ==> enp2s5f5 (Down) mlx5_14 port 1 ==> enp2s5f6 (Down) mlx5_15 port 1 ==> enp2s5f7 (Down) mlx5_16 port 1 ==> enp2s6 (Down) mlx5_17 port 1 ==> enp2s6f1 (Down) mlx5_18 port 1 ==> enp2s6f2 (Down) mlx5_19 port 1 ==> enp2s6f3 (Down) mlx5_20 port 1 ==> enp2s6f4 (Down) mlx5_21 port 1 ==> enp2s6f5 (Down) mlx5_22 port 1 ==> enp2s6f6 (Down) mlx5_23 port 1 ==> enp2s6f7 (Down) mlx5_24 port 1 ==> enp2s7 (Down) mlx5_25 port 1 ==> enp2s7f1 (Down) mlx5_26 port 1 ==> enp2s7f2 (Down) mlx5_27 port 1 ==> enp2s7f3 (Down) mlx5_28 port 1 ==> enp2s7f4 (Down) mlx5_29 port 1 ==> enp2s7f5 (Down) mlx5_30 port 1 ==> enp2s7f6 (Down) mlx5_31 port 1 ==> enp2s7f7 (Down) mlx5_32 port 1 ==> enp2s8 (Down) mlx5_33 port 1 ==> enp2s8f1 (Down)

I have 2 nodes connected back to back on ens3f1.

And, following is the log of rdma device plugin kubectl logs -n kube-system rdma-sriov-dp-ds-h9fmb

2018/10/18 17:46:02 Starting K8s RDMA SRIOV Device Plugin version= 0.2 2018/10/18 17:46:02 Starting FS watcher. 2018/10/18 17:46:02 Starting OS watcher. 2018/10/18 17:46:02 Reading /k8s-rdma-sriov-dev-plugin/config.json 2018/10/18 17:46:02 loaded config: {"mode":"sriov","pfNetdevices":["ens3f1"]} 2018/10/18 17:46:02 sriov device mode Configuring SRIOV on ndev= ens3f1 6 max_vfs = 32 cur_vfs = 32 vf = &{26 virtfn26 true false} vf = &{7 virtfn7 false false} Fail to config vfs for ndev = ens3f1 Fail to configure sriov; error = Link not found 2018/10/18 17:46:02 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock 2018/10/18 17:46:02 Registered device plugin with Kubelet exposing devices: []

moshe010 commented 5 years ago

So are you using sriov-cni as well? do you have some containers that are using the VF in that node?

songole commented 5 years ago

Yes for sriov-cni.

It worked for a few PODs initially which got sriov vf. I wasn't able to launch more, I was getting admission error. So, I cleaned up all the PODs, rebooted and restarted device plugin.

Following is the output of get pods on k8s. IPs in 10.55.xx.xx are given out by sriov-cni.

dkube@dkube-217:~$ kubectl get po -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE etcd-dkube-217 1/1 Running 17 15d 192.168.50.217 dkube-217 kube-apiserver-dkube-217 1/1 Running 18 15d 192.168.50.217 dkube-217 kube-controller-manager-dkube-217 1/1 Running 27 15d 192.168.50.217 dkube-217 kube-dns-86f4d74b45-tnbl8 1/3 CrashLoopBackOff 3988 9d 10.55.206.47 dkube-217 kube-proxy-ttcjv 1/1 Running 12 15d 192.168.50.228 smicro-228 kube-proxy-xk7z2 1/1 Running 17 15d 192.168.50.217 dkube-217 kube-scheduler-dkube-217 1/1 Running 18 15d 192.168.50.217 dkube-217 kube-sriov-cni-ds-installer-ktqp7 1/1 Running 4 1d 192.168.50.228 smicro-228 kube-sriov-cni-ds-installer-lhwqr 1/1 Running 5 1d 192.168.50.217 dkube-217 nvidia-device-plugin-daemonset-7cr2x 1/1 Running 8 10d 10.55.206.25 smicro-228 nvidia-device-plugin-daemonset-x6wb6 1/1 Running 12 10d 10.55.206.52 dkube-217 rdma-sriov-dp-ds-h9fmb 1/1 Running 0 3m 192.168.50.217 dkube-217 rdma-sriov-dp-ds-xjjn7 1/1 Running 0 3m 192.168.50.228 smicro-228

moshe010 commented 5 years ago

ok so some VFs move to other namespace and you restarted the device plugin. That the same issue we saw as well. I will look in to the code next week and provide a fix for that

songole commented 5 years ago

Thanks. Could I expect a fix by early next week? We are blocked on RDMA experiments.

harryge00 commented 5 years ago

After deleting the pod, the new pod works fine.

harryge00 commented 5 years ago
# ibdev2netdev
mlx5_0 port 1 ==> ib0 (Up)
mlx5_1 port 1 ==> ib1 (Up)

I don't know how to list VF devices...

[root@gpu28 ~]# find /sys/devices/ -name sriov_totalvfs
/sys/devices/pci0000:5d/0000:5d:02.0/0000:5f:00.0/0000:60:03.0/0000:61:00.0/sriov_totalvfs
/sys/devices/pci0000:5d/0000:5d:02.0/0000:5f:00.0/0000:60:03.0/0000:61:00.1/sriov_totalvfs
/sys/devices/pci0000:5d/0000:5d:02.0/0000:5f:00.0/0000:60:03.0/0000:61:00.2/sriov_totalvfs
/sys/devices/pci0000:5d/0000:5d:02.0/0000:5f:00.0/0000:60:03.0/0000:61:00.3/sriov_totalvfs
[root@gpu28 ~]# cat /sys/devices/pci0000:5d/0000:5d:02.0/0000:5f:00.0/0000:60:03.0/0000:61:00.0/sriov_totalvfs
32

So I think the total VF number is 32 * 4 = 128?

mak-454 commented 5 years ago

After deleting the pod, the new pod works fine.

I deleted the running PODs, deleted the device plugin daemon set and still not able to recover. Is there a way to recover from this situation? Please suggest.

moshe010 commented 5 years ago

You can try to reboot the server or reload the mlnx driver

songole commented 5 years ago

Rebooting didn't help.

moshe010 commented 5 years ago

Do you see any errors in the dmesg? can you try to decrease the NUM_OF_VFS=16 with mft?

songole commented 5 years ago

I can try that. However, this requires to enable SR-IOV for the entire cluster. I need more than 16 to bring up my application. It would be an academic excercise for me unless I can enable multus cni. I am waiting for documentation on that as we ran into issues integrating with multus.

songole commented 5 years ago

Attached are dmesg logs for mlx. mlx_logs.txt

songole commented 5 years ago

Reduced the VFs to 16. Problem persists.

2018/10/22 22:34:50 sriov device mode Configuring SRIOV on ndev= ens3f1 6 max_vfs = 16 cur_vfs = 16 vf = &{7 virtfn7 true false} vf = &{5 virtfn5 true false} vf = &{14 virtfn14 true false} vf = &{3 virtfn3 true false} vf = &{12 virtfn12 true false} vf = &{1 virtfn1 true false} vf = &{10 virtfn10 true false} vf = &{8 virtfn8 true false} vf = &{6 virtfn6 true false} vf = &{15 virtfn15 true false} vf = &{4 virtfn4 false false} Fail to config vfs for ndev = ens3f1 Fail to configure sriov; error = Link not found

songole commented 5 years ago

Do you have a fix in the works?

harryge00 commented 5 years ago

@songole Maybe you can check if the VF are used by other processes?

songole commented 5 years ago

@harryge00 No. I am running kubernetes on this node and nothing else.

mak-454 commented 5 years ago

@harryge00 after reinstalling kubernetes cluster - things started working. Deleting the jobs and rebooting the nodes didnt help, had to reinstall the k8s to recover.

paravmellanox commented 5 years ago

@songole we should have fix soon. I am facing some issue with accessing repo on github.

paravmellanox commented 5 years ago

@harryge00 @songole @moshe010 @mak-454 new fix is available at docker hub now. https://hub.docker.com/r/rdma/k8s-rdma-sriov-dev-plugin

Please delete the old container image and pull the new one for device plugin. Sometimes I have seen K8s is not able to pick up the new image and repeatedly fails. And in that case only option I found was to reconfigure, but may be fixed in latest k8s.

moshe010 commented 5 years ago

@harryge00 or @songole can you close the issue

songole commented 5 years ago

We don't see the problem anymore with the fix. @moshe010 I don't think I have the permissions to close it.