Mellanox / k8s-rdma-sriov-dev-plugin

Kubernetes Rdma SRIOV device plugin
Apache License 2.0
109 stars 27 forks source link

rdma sriov device plugin returns device or resource busy #12

Closed dahsing closed 5 years ago

dahsing commented 5 years ago

Hi, I've met a problem. I have several nodes deployed with rdma sriov device plugin and the sriov cni. when i running the rdma sriov device plugin , it returns error. I've checked the log as below:

[root@localhost bin]# kubectl logs rdma-sriov-dp-ds-mzv9m  -n kube-system
2018/09/18 03:21:57 Starting K8s RDMA SRIOV Device Plugin version= 0.2
2018/09/18 03:21:57 Starting FS watcher.
2018/09/18 03:21:57 Starting OS watcher.
2018/09/18 03:21:57 Reading /k8s-rdma-sriov-dev-plugin/config.json
2018/09/18 03:21:57 loaded config:  {"mode":"sriov","pfNetdevices":["ens5f0"]}
2018/09/18 03:21:57 sriov device mode
Configuring SRIOV on ndev= ens5f0 6
max_vfs =  8
cur_vfs =  0
Fail to enable sriov for netdev = ens5f0
Fail to configure sriov; error =  write /sys/class/net/ens5f0/device/sriov_numvfs: device or resource busy
2018/09/18 03:21:57 Starting to serve on /var/lib/kubelet/device-plugins/rdma-sriov-dp.sock
2018/09/18 03:21:57 Registered device plugin with Kubelet
exposing devices:  []

And i check the /sys/class/net/ens5f0/device/sriov_numvfs file with echo.

[root@localhost bin]# echo 0 > /sys/class/net/ens5f0/device/sriov_numvfs
[root@localhost bin]# echo 8 > /sys/class/net/ens5f0/device/sriov_numvfs 
-bash: echo: write error: Device or resource busy

Enviroment:

[root@localhost bin]# mst version
mst, mft 4.10.0-104, built on Jul 01 2018, 17:14:32. Git SHA Hash: 9999fe7

[root@localhost bin]# kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
[root@localhost bin]# kubelet --version
Kubernetes v1.10.4

Thanks for your help!

paravmellanox commented 5 years ago

@BigXin , can you please share the output of kernel log messages such as /var/log/messages?

dahsing commented 5 years ago

@paravmellanox

Sep 18 00:35:28 localhost kernel: mlx5_core 0000:82:00.0: mlx5_device_enable_sriov:87:(pid 26690): failed to enable SRIOV on device, already enabled with 8 vfs
Sep 18 00:35:28 localhost kernel: mlx5_core 0000:82:00.0: mlx5_sriov_enable:196:(pid 26690): mlx5_device_enable_sriov failed : -16
Sep 18 00:35:29 localhost kernel: IPVS: Creating netns size=2040 id=6317
Sep 18 00:35:29 localhost kernel: mlx5_core 0000:82:00.0: mlx5_device_enable_sriov:87:(pid 26810): failed to enable SRIOV on device, already enabled with 8 vfs
Sep 18 00:35:29 localhost kernel: mlx5_core 0000:82:00.0: mlx5_sriov_enable:196:(pid 26810): mlx5_device_enable_sriov failed : -16
Sep 18 00:35:30 localhost kernel: IPVS: Creating netns size=2040 id=6318
Sep 18 00:35:30 localhost kernel: mlx5_core 0000:82:00.0: mlx5_device_enable_sriov:87:(pid 26933): failed to enable SRIOV on device, already enabled with 8 vfs
Sep 18 00:35:30 localhost kernel: mlx5_core 0000:82:00.0: mlx5_sriov_enable:196:(pid 26933): mlx5_device_enable_sriov failed : -16
Sep 18 00:35:31 localhost kernel: IPVS: Creating netns size=2040 id=6319

it looks like have already enabled with 8 vfs

paravmellanox commented 5 years ago

@BigXin when you read the sriov_numvfs and if that reads 0 but if you still this error message from driver, it indicates a driver bug who failed to enable sriov at first time likely to due to missing to enable sriov at BIOS level, is it possible to share a system log from beginning (from system boot time?)

dahsing commented 5 years ago

@paravmellanox I'll check the BIOS first ,maybe this node was forgetting to enable sriov

paravmellanox commented 5 years ago

@BigXin ok. Also share the output of lspci -vvv if possible in case if the issue is not resolved.

dahsing commented 5 years ago

@paravmellanox Thanks a lot, it was resolved.

paravmellanox commented 5 years ago

@BigXin awesome, can you please close the issue?