Xilinx / FPGA_as_a_Service

https://docs.xilinx.com/r/en-US/Xilinx_Kubernetes_Device_Plugin/Xilinx_Kubernetes_Device_Plugin
Apache License 2.0
143 stars 60 forks source link

Following documentation is not working properly. #39

Open wkozlowski750 opened 1 year ago

wkozlowski750 commented 1 year ago

I have a kubernetes cluster running with two nodes, using Calico CNI. One node has two U55C's and the other has 1 U55c installed, all cards flashed and XRT installed on all nodes. I am following the instructions in this document: https://docs.xilinx.com/r/en-US/Xilinx_Kubernetes_Device_Plugin/Installing-K8s-Device-Plugin-on-Kubernetes

When I start the daemonset via the instruction

kubectl apply -f ./k8s-device-plugin.yml

the pods are stuck in a crashback loop. I get the following when I get the logs for the pods

time="2023-06-27T22:14:23Z" level=info msg="Plugin Version: 1.2.0" time="2023-06-27T22:14:23Z" level=info msg="Set U30NameConvention: CommonName" time="2023-06-27T22:14:23Z" level=info msg="Set U30AllocUnit: Card" time="2023-06-27T22:14:23Z" level=info msg="Set DeviceNameCustomize: False" time="2023-06-27T22:14:23Z" level=info msg="Virtual Device Mode: OFF" time="2023-06-27T22:14:23Z" level=warning msg="Invalid input for VirtualNum, will set VirtualNum as 1" time="2023-06-27T22:14:23Z" level=info msg="VirtualNum: 1" time="2023-06-27T22:14:23Z" level=info msg="Starting FS watcher." time="2023-06-27T22:14:23Z" level=info msg="Starting OS watcher." panic: runtime error: index out of range [1] with length 1

goroutine 5 [running]: main.GetDevices() /root/yuzhang/upgrade/k8s-fpga-device-plugin-1/fpga.go:219 +0x18c5 main.NewFPGADevicePlugin.func1() /root/yuzhang/upgrade/k8s-fpga-device-plugin-1/server.go:149 +0x99 created by main.NewFPGADevicePlugin /root/yuzhang/upgrade/k8s-fpga-device-plugin-1/server.go:147 +0x438

Could you provide any insight into why this is not working?

yuzhang66 commented 1 year ago

Hi, @wkozlowski750, This issue might lead by an unexpected error with U55c shell support, could you have a try with the last version of device plugin images and see what happened? You can try replace the image in k8s-device-plugin.yml with public.ecr.aws/xilinx_dcg/k8s-device-plugin:1.1.0

wkozlowski750 commented 1 year ago

@yuzhang66, Thanks for the response. I have tried this and I now get the following error shown in the logs:

time="2023-06-29T16:22:44Z" level=info msg="Plugin Version: 1.1.0" time="2023-06-29T16:22:44Z" level=info msg="Set U30NameConvention: CommonName" time="2023-06-29T16:22:44Z" level=info msg="Set U30AllocUnit: Card" time="2023-06-29T16:22:44Z" level=info msg="Starting FS watcher." time="2023-06-29T16:22:44Z" level=info msg="Starting OS watcher." time="2023-06-29T16:23:05Z" level=info msg="Starting to serve on /var/lib/kubelet/device-plugins/-0-fpga.sock" time="2023-06-29T16:23:05Z" level=info msg="Starting to serve on /var/lib/kubelet/device-plugins/xilinx_u55c_gen3x16_xdma_base_3-0-fpga.sock" 2023/06/29 16:23:05 transport: http2Server.HandleStreams failed to read frame: read unix /var/lib/kubelet/device-plugins/-0-fpga.sock->@: read: connection reset by peer 2023/06/29 16:23:05 transport: http2Server.HandleStreams failed to read frame: read unix /var/lib/kubelet/device-plugins/xilinx_u55c_gen3x16_xdma_base_3-0-fpga.sock->@: read: connection reset by peer time="2023-06-29T16:23:05Z" level=error msg="Could not register device plugin: rpc error: code = Unknown desc = the ResourceName \"amd.com/-0\" is invalid" time="2023-06-29T16:23:05Z" level=info msg="Could not contact Kubelet, Exit. Did you enable the device plugin feature gate?"

I have tried enabling the device plugin feature gate, but after looking into it, V1.27.3 of kubernetes (which I am using) has the DevicePlugins feature gate as a graduated feature which means it is always on and can not be turned off. Any advice?

yuzhang66 commented 1 year ago

Hi @wkozlowski750 ,

Yes, the DevicePlugins feature gate is always on.

From your log:

2023/06/29 16:23:05 transport: http2Server.HandleStreams failed to read frame: read unix /var/lib/kubelet/device-plugins/-0-fpga.sock->@: read: connection reset by peer

one of your u55c devices is not flashing properly, device plugin can not read the correct shellname of it. Could you confirm it by running "xbutil examine" on the host instance and checking the result? From what I saw in the logs, you could only see one u55c marked as ready.

Thank you, --Yu

wkozlowski750 commented 1 year ago

@yuzhang66

I did flash both boards with the base image as well as the firmware update. Running xbutil examine from the host I get the following:

System Configuration OS Name : Linux Release : 5.4.0-152-generic Version : #169-Ubuntu SMP Tue Jun 6 22:23:09 UTC 2023 Machine : x86_64 CPU Cores : 128 Memory : 515804 MB Distribution : Ubuntu 20.04.6 LTS GLIBC : 2.31 Model : AS -4124GS-TNR

XRT Version : 2.15.225 Branch : 2023.1 Hash : adf27adb3cfadc6e4c41d6db814159f1329b24f3 Hash Date : 2023-05-03 10:13:38 XOCL : 2.15.225, adf27adb3cfadc6e4c41d6db814159f1329b24f3 XCLMGMT : 2.15.225, adf27adb3cfadc6e4c41d6db814159f1329b24f3

Devices present BDF : Shell Platform UUID Device ID Device Ready*


[0000:81:00.1] : xilinx_u55c_gen3x16_xdma_base_3 97088961-FEAE-DA91-52A2-1D9DFD63CCEF user(inst=134) Yes

[0000:c1:00.1] : xilinx_u55c_gen3x16_xdma_base_3 97088961-FEAE-DA91-52A2-1D9DFD63CCEF user(inst=133) Yes

xbmgmt examine gives the following:

System Configuration OS Name : Linux Release : 5.4.0-152-generic Version : #169-Ubuntu SMP Tue Jun 6 22:23:09 UTC 2023 Machine : x86_64 CPU Cores : 128 Memory : 515804 MB Distribution : Ubuntu 20.04.6 LTS GLIBC : 2.31 Model : AS -4124GS-TNR

XRT Version : 2.15.225 Branch : 2023.1 Hash : adf27adb3cfadc6e4c41d6db814159f1329b24f3 Hash Date : 2023-05-03 10:13:38 XOCL : 2.15.225, adf27adb3cfadc6e4c41d6db814159f1329b24f3 XCLMGMT : 2.15.225, adf27adb3cfadc6e4c41d6db814159f1329b24f3

Devices present BDF : Shell Platform UUID Device ID Device Ready*

[0000:81:00.0] : xilinx_u55c_gen3x16_xdma_base_3 97088961-FEAE-DA91-52A2-1D9DFD63CCEF mgmt(inst=33024) Yes

[0000:c1:00.0] : xilinx_u55c_gen3x16_xdma_base_3 97088961-FEAE-DA91-52A2-1D9DFD63CCEF mgmt(inst=49408) Yes

Both devices seem to appear ready, and pass all validate tests that come with XRT when run on the host machine.

I appreciate the help, -Will

yuzhang66 commented 1 year ago

Thank you @wkozlowski750 ,

This issue is related to XRT generating the shell name files (file VBNV), but the device plugin is not reading the correct shell name on one of the u55c devices.

Could you help check the following files and see the print VBNV file content?

$ cat /sys/bus/pci/devices/[u55c device BDF]/rom.u.*/VBNV

Here is an example (device BDF 0000:82:00.1): $ cat /sys/bus/pci/devices/0000:82:00.1/rom.u.1/VBNV xilinx_u50_gen3x16_xdma_base_5

Thank you for helping to figure this out. I'm also trying to reproduce this error on my side.

--Yu

wkozlowski750 commented 1 year ago

@yuzhang66

Running the command for both devices I get the following:

wk10@mlcluster3:~/kube_tests$ cat /sys/bus/pci/devices/0000:81:00.1/rom.u.*/VBNV xilinx_u55c_gen3x16_xdma_base_3

wk10@mlcluster3:~/kube_tests$ cat /sys/bus/pci/devices/0000:c1:00.1/rom.u.*/VBNV xilinx_u55c_gen3x16_xdma_base_3

Thanks, -Will

yuzhang66 commented 1 year ago

Thank you @wkozlowski750 ,

I'm going to reproduce this issue. Is your one u55c worker node has the same issue?

--Yu

wkozlowski750 commented 1 year ago

@yuzhang66

Yes, the worker node has the same issue.

-Will

yuzhang66 commented 1 year ago

Thank you, I will check it and give you feedback asap.

iavssw commented 11 months ago

Has this plugin ever been validated on the u55c and is there a list of supported platforms that have been validated.

yuzhang66 commented 11 months ago

Hi @wkozlowski750 @iavssw,

We have verified that the device plugin and u55c cards work on kubernetes version v1.26 and v1.27 with shell package version xilinx_u55c_gen3x16_xdma_base_3, XRT version 2.15.225.

@wkozlowski750, currently I can't reproduce this issue in my environment. Is there any update from your side?

Thanks.

scottcs2 commented 11 months ago

@yuzhang66 we still have this issue (I am working with @wkozlowski750). Running the device plugin on Kubernetes v1.27.4 with shell xilinx_u55c_gen3x16_xdma_base_3, XRT 2.15.225 does not work. I don't know how to proceed debugging this. Is there more information we can provide you? The plugin repeatedly crashes every few seconds, and no FPGA resources show up on kubectl describe node <node-name>.

To clarify, we are running this on a single node. The head node is the one equipped with two U55Cs. The K8s cluster setup only has one node on it.

This is the output of kubectl logs -n kube-system -p device-plugin-daemonset-gd9l9 https://pastebin.com/VAY42fg4

iavssw commented 11 months ago

Copyright 2018-2022, Xilinx, Inc.

Copyright 2023, Advanced Micro Device, Inc.

Author: Brian Xu(brianx@xilinx.com)

For technical support, please contact k8s_dev@amd.com

#

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

#

http://www.apache.org/licenses/LICENSE-2.0

#

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

apiVersion: apps/v1

if run with k8s v1.16-, replace the above line with

apiVersion: extensions/v1beta1

kind: DaemonSet metadata: name: device-plugin-daemonset namespace: kube-system spec:

if run with k8s v1.16-, the following 3 lines are not required

selector: matchLabels: name: device-plugin template: metadata: labels: name: device-plugin spec: tolerations: priorityClassName: "system-node-critical" containers:


Copyright 2018-2022, Xilinx, Inc.

Copyright 2023, Advanced Micro Device, Inc.

Author: Brian Xu(brianx@xilinx.com)

For technical support, please contact k8s_dev@amd.com

#

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

#

http://www.apache.org/licenses/LICENSE-2.0

#

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

apiVersion: v1 kind: Pod metadata: name: mypod spec: containers:


image


**This configuration seems to work when we use an older version.

when we call describe node I we get multiple entries where there seems to be a difference between amd and xilinx? could this be an issue with the later versions?

Any clarification from your side would be appreciated.**

yuzhang66 commented 11 months ago

Hi @iavssw, The the device register name format is being changed in new version, the device name format is changing from xilinx.com/fpga-[shell version]-[timestamp] to amd.com/[shell version]-[timestamp] After you migrated to the new version device plugin, the device will be registered in the new format, and the old device name will still exist here because of the cache from Kubernetes. In later versions, all Alveo devices will be registered in the new device format (amd.com/[shell version]-[timestamp]).

This is the kubectl describe node result on my side: image

From your result, I noticed the device plugin is trying to register a device with no shell version loaded (xilinx.com/fpga--0) which means there's one card that is not being flashed properly, Are there any unflashed Alveo cards installed on this node when you noticed this issue?

iavssw commented 11 months ago

Thank you for you reply.

Yes we have another device that we did not program and is not currently an issue.

However, can I ask which version of the plugin you are using as we were unable to detect the devices using version 1.1 and 1.2 and thus we currently have something working where we are able to detect the device using version 1.0.101.

However, when we try to run a test example that ran successfully on the host when we try to run this within a pod we get the following issue by XRT.

image

When I run dmsg I get the following errors

image

Is this a problem on our side or is this the result of us using plugin version 1.0.101. If that is the case, any ideas why the same system works with 1.0.101 and not for 1.1.0 and 1.2.0? Any help will be appreciated.

yuzhang66 commented 11 months ago

Hi @iavssw, Can you provide more about which device you cannot detect with device plugin version 1.1/1.2? Is it the U55c device with shell version xilinx_u55c_gen3x16_xdma_base_3? I tested with the device plugin version 1.101/1.1/1.2 under my 1.27 Kubernetes env, and all "ready" devices can be detected and registered. The unflashed device may be the reason why device plugin 1.1/1.2 is not working properly.

For the pod application Error, it looks like an XRT issue, is the pod docker image using a matched XRT version with the host XRT? XRT required containers and hosts using the same version of XRT.