Closed kenhill closed 2 years ago
Ken, can you try this plugin and let me know if it works? huaxu/xilinx_k8s_fpga_plugin:06302020
I believe this is a pure plugin issue. 'xbutil reset' behavior works as expected -- after the reset, 'xbutil validate' still works, right?
-Brian
Brian, this version fixed the allocation issue for pending pods while introducing a new problem. The FPGA becomes unusable in the current pod after 'xbutil reset' finishes. Here is the output of 'xbutil validate' after a reset inside a pod:
$ xbutil validate INFO: Found 1 cards XRT build version: 2.6.655 Build hash: 2d6bfe4ce91051d4e5b499d38fc493586dd4859a Build date: 2020-05-22 12:03:17 Git branch: 2020.1 PID: 409 UID: 505 [Tue Aug 18 22:37:37 2020] HOST: jarvice-job-5452-bh24g EXE: /opt/xilinx/xrt/bin/unwrapped/xbutil [XRT] WARNING: XRT
INFO: Validating card[0]: xilinx_u50_gen3x16_xdma_201920_3 INFO: == Starting Kernel version check: INFO: == Kernel version check PASSED INFO: == Starting AUX power connector check: AUX power connector not available. Skipping validation INFO: == AUX power connector check SKIPPED INFO: == Starting PCIE link check: INFO: == PCIE link check PASSED INFO: == Starting SC firmware version check: INFO: == SC firmware version check PASSED INFO: == Starting verify kernel test: ERROR: Failed to download xclbin: verify.xclbin ERROR: == verify kernel test FAILED INFO: Card[0] failed to validate.
ERROR: Some cards failed to validate.
Ken, I just tried at my end. Seems everything works fine. root@mypod:/# xbutil reset All existing processes will be killed. Are you sure you wish to proceed? [y/n]: y root@mypod:/# xbutil validate -q INFO: Found 1 cards
INFO: Validating card[0]: xilinx_u200_xdma_201830_2 INFO: == Starting AUX power connector check: INFO: == AUX power connector check PASSED INFO: == Starting PCIE link check: INFO: == PCIE link check PASSED INFO: == Starting SC firmware version check: INFO: == SC firmware version check PASSED INFO: == Starting verify kernel test: INFO: == verify kernel test PASSED INFO: Card[0] validated successfully.
INFO: All cards validated successfully. root@mypod:/#
Several things to check,
Hi Brian,
This does not appear to be an XRT issue. After 'xbutil reset', the 'renderD128' device node comes back with different permissions inside the container which require root access.
khill@jarvice-job-5476-qgjk8:~$ ls -l /dev/dri/renderD128 crw-rw-rw- 1 root video 226, 128 Aug 19 22:03 /dev/dri/renderD128 khill@jarvice-job-5476-qgjk8:~$ xbutil validate -q INFO: Found 1 cards
INFO: Validating card[0]: xilinx_u50_gen3x16_xdma_201920_3 INFO: == Starting Kernel version check: INFO: == Kernel version check PASSED INFO: == Starting AUX power connector check: AUX power connector not available. Skipping validation INFO: == AUX power connector check SKIPPED INFO: == Starting PCIE link check: INFO: == PCIE link check PASSED INFO: == Starting SC firmware version check: INFO: == SC firmware version check PASSED INFO: == Starting verify kernel test: INFO: == verify kernel test PASSED INFO: Card[0] validated successfully.
INFO: All cards validated successfully. khill@jarvice-job-5476-qgjk8:~$ xbutil reset All existing processes will be killed. Are you sure you wish to proceed? [y/n]: y khill@jarvice-job-5476-qgjk8:~$ ls -l /dev/dri/renderD128 c--------- 0 root root 226, 128 Aug 19 22:03 /dev/dri/renderD128
Ken,
This looks like a k8s issue...
You can try this, before running reset, just kill the plugin DS -- you can still use the card within the pod. then after the reset the ACL of the file still changes to what you saw. That means, the ACL is modified by the k8s itself.
As a comparison, reset FPGA within docker container, the ACL of the file doesn't change. I believe for the lxc containers you have in Jarvice 2, you don't have the issue either.
Looks like it is not a big issue. Before the issue is fixed in k8s, if you prefer Jarvice users not run as root within container, then after reset, tell the user to su to root and run 'chmod a+rw /dev/dri/xxxx' within the container, then non-root within container can use the FPGA again.
-Brian
Thanks, Brian. I agree the dev node permissions is a separate issue outside of the device plugin/XRT
Can you commit your device plugin changes here and push a new container to DockerHub: xilinxatg/xilinx_k8s_fpga_plugin?
Running 'xbutil reset' will prevent the FPGAs on a node from becoming allocatable to scheduled pods until the DS pod on the node is deleted/restarted