Xilinx / FPGA_as_a_Service

https://docs.xilinx.com/r/en-US/Xilinx_Kubernetes_Device_Plugin/Xilinx_Kubernetes_Device_Plugin
Apache License 2.0
148 stars 60 forks source link

FPGA #43

Open iavssw opened 7 months ago

iavssw commented 7 months ago

When running more that one job inside a pod cannot submit more than one job reliably. If more that one job is summitted in succession we get a input output error. This problem can be mitigated by xbutil reset from the host before a pod is spun up but this is not a desirable .

Any feedback would be grateful.

user@mlcluster-interactive-example-jfdz2:~/FPGA_test$ ./host vadd_hw.xclbin 512 0 1 64

 Total Data of 512.000 Mbytes to be written to global memory from host

 Kernel is invoked 1 time and repeats itself 1 times

Found Platform
Platform Name: Xilinx
DEVICE xilinx_u55c_gen3x16_xdma_base_3
INFO: Reading vadd_hw.xclbin
Loading: 'vadd_hw.xclbin'
- host loop iteration #0 of 1 total iterations
kernel_time_in_sec = 0.0421578
Duration using events profiling: 42050286 ns
 match_count = 134217728 mismatch_count = 0 total_data_size = 134217728
Throughput Achieved = 12.7674 GB/s
TEST PASSED
user@mlcluster-interactive-example-jfdz2:~/FPGA_test$ ./host vadd_hw.xclbin 512 0 1 64

 Total Data of 512.000 Mbytes to be written to global memory from host

 Kernel is invoked 1 time and repeats itself 1 times

Found Platform
Platform Name: Xilinx
DEVICE xilinx_u55c_gen3x16_xdma_base_3
INFO: Reading vadd_hw.xclbin
Loading: 'vadd_hw.xclbin'
- host loop iteration #0 of 1 total iterations
XRT build version: 2.14.384
Build hash: 090bb050d570d2b668477c3bd0f979dc3a34b9db
Build date: 2022-12-09 00:55:08
Git branch: 2022.2
PID: 99
UID: 1006
[Mon Apr  8 15:10:45 2024 GMT]
HOST: mlcluster-interactive-example-jfdz2
EXE: /home/gregj/FPGA_test/host
[XRT] ERROR: unable to sync BO: Input/output error
terminate called after throwing an instance of 'xrt_xocl::error'
  what():  event 0 never submitted
Aborted (core dumped)
yuzhang66 commented 7 months ago

Hi @iavssw, this issue may related to the XRT container solution, could try to run this test under a pure container environment without k8s and see what happens? If it can be reproduced, I suggest to reach the XRT team for further help.