Xilinx / FPGA_as_a_Service

https://docs.xilinx.com/r/en-US/Xilinx_Kubernetes_Device_Plugin/Xilinx_Kubernetes_Device_Plugin
Apache License 2.0
143 stars 60 forks source link

Xbutil reset prevents FPGA resources from becoming allocatable #17

Closed kenhill closed 2 years ago

kenhill commented 3 years ago

Running 'xbutil reset' will prevent the FPGAs on a node from becoming allocatable to scheduled pods until the DS pod on the node is deleted/restarted

xuhz commented 3 years ago

Ken, can you try this plugin and let me know if it works? huaxu/xilinx_k8s_fpga_plugin:06302020

I believe this is a pure plugin issue. 'xbutil reset' behavior works as expected -- after the reset, 'xbutil validate' still works, right?

-Brian

kenhill commented 3 years ago

Brian, this version fixed the allocation issue for pending pods while introducing a new problem. The FPGA becomes unusable in the current pod after 'xbutil reset' finishes. Here is the output of 'xbutil validate' after a reset inside a pod:

$ xbutil validate INFO: Found 1 cards XRT build version: 2.6.655 Build hash: 2d6bfe4ce91051d4e5b499d38fc493586dd4859a Build date: 2020-05-22 12:03:17 Git branch: 2020.1 PID: 409 UID: 505 [Tue Aug 18 22:37:37 2020] HOST: jarvice-job-5452-bh24g EXE: /opt/xilinx/xrt/bin/unwrapped/xbutil [XRT] WARNING: XRT

INFO: Validating card[0]: xilinx_u50_gen3x16_xdma_201920_3 INFO: == Starting Kernel version check: INFO: == Kernel version check PASSED INFO: == Starting AUX power connector check: AUX power connector not available. Skipping validation INFO: == AUX power connector check SKIPPED INFO: == Starting PCIE link check: INFO: == PCIE link check PASSED INFO: == Starting SC firmware version check: INFO: == SC firmware version check PASSED INFO: == Starting verify kernel test: ERROR: Failed to download xclbin: verify.xclbin ERROR: == verify kernel test FAILED INFO: Card[0] failed to validate.

ERROR: Some cards failed to validate.

xuhz commented 3 years ago

Ken, I just tried at my end. Seems everything works fine. root@mypod:/# xbutil reset All existing processes will be killed. Are you sure you wish to proceed? [y/n]: y root@mypod:/# xbutil validate -q INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u200_xdma_201830_2 INFO: == Starting AUX power connector check: INFO: == AUX power connector check PASSED INFO: == Starting PCIE link check: INFO: == PCIE link check PASSED INFO: == Starting SC firmware version check: INFO: == SC firmware version check PASSED INFO: == Starting verify kernel test: INFO: == verify kernel test PASSED INFO: Card[0] validated successfully.

INFO: All cards validated successfully. root@mypod:/#

Several things to check,

  1. Can you run validate on the host?
  2. Since you can still access the FPGA within the pod, I believe the host doesn't revoke the 'rmw' setting of the device. That is to say, the plugin is working fine.
  3. What is the dmesg for your download xclbin failure?
kenhill commented 3 years ago

Hi Brian,

This does not appear to be an XRT issue. After 'xbutil reset', the 'renderD128' device node comes back with different permissions inside the container which require root access.

khill@jarvice-job-5476-qgjk8:~$ ls -l /dev/dri/renderD128 crw-rw-rw- 1 root video 226, 128 Aug 19 22:03 /dev/dri/renderD128 khill@jarvice-job-5476-qgjk8:~$ xbutil validate -q INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u50_gen3x16_xdma_201920_3 INFO: == Starting Kernel version check: INFO: == Kernel version check PASSED INFO: == Starting AUX power connector check: AUX power connector not available. Skipping validation INFO: == AUX power connector check SKIPPED INFO: == Starting PCIE link check: INFO: == PCIE link check PASSED INFO: == Starting SC firmware version check: INFO: == SC firmware version check PASSED INFO: == Starting verify kernel test: INFO: == verify kernel test PASSED INFO: Card[0] validated successfully.

INFO: All cards validated successfully. khill@jarvice-job-5476-qgjk8:~$ xbutil reset All existing processes will be killed. Are you sure you wish to proceed? [y/n]: y khill@jarvice-job-5476-qgjk8:~$ ls -l /dev/dri/renderD128 c--------- 0 root root 226, 128 Aug 19 22:03 /dev/dri/renderD128

xuhz commented 3 years ago

Ken,

This looks like a k8s issue...

You can try this, before running reset, just kill the plugin DS -- you can still use the card within the pod. then after the reset the ACL of the file still changes to what you saw. That means, the ACL is modified by the k8s itself.

As a comparison, reset FPGA within docker container, the ACL of the file doesn't change. I believe for the lxc containers you have in Jarvice 2, you don't have the issue either.

Looks like it is not a big issue. Before the issue is fixed in k8s, if you prefer Jarvice users not run as root within container, then after reset, tell the user to su to root and run 'chmod a+rw /dev/dri/xxxx' within the container, then non-root within container can use the FPGA again.

-Brian

kenhill commented 3 years ago

Thanks, Brian. I agree the dev node permissions is a separate issue outside of the device plugin/XRT

Can you commit your device plugin changes here and push a new container to DockerHub: xilinxatg/xilinx_k8s_fpga_plugin?