Xilinx / XRT

Run Time for AIE and FPGA based platforms
https://xilinx.github.io/XRT
Other
559 stars 474 forks source link

Host Applications Control FPGA Without Using OpenCL/XRT API #7613

Open qianyich opened 1 year ago

qianyich commented 1 year ago

To the best of my knowledge, XRT API is written in C++. Is there a way I can use pure C to allocate memory, launch the FPGA kernel​, and migrate data back and forth between host and device? In our scenario, we prefer using C. I just found a page down at the bottom that explains the execution flow: Execution Model Overview — XRT Master documentation (xilinx.github.io) I would like to know if there is a way to do memory allocation, data transfer, task launch without using OpenCL API or XRT API. I want to use some very low-level calls.

So may I know how can I directly use those ioctl calls to instruct FPGA? Any examples?

Thanks!

uday610 commented 1 year ago

We do have basic C API support equivalent to XRT native C++ API support. However, we dont enhance or add new C API request. We encourage using C++ API, you can make any C++ API to C API by writing your own wrapper, this what we recommend.

But if you still would like to see the basic C APIs you can see them in our old documentation:

https://xilinx.github.io/XRT/2020.2/html/xrt_native_apis.html

Thanks

qianyich commented 1 year ago

Thank you. This is useful! I would also like to know if there is a way to do memory allocation, data transfer, kernel launch in the Linux kernel space? I saw there are a bunch of ioctl calls listed in the documents, and I saw the following flow:

Execution Flow A typical user execution flow would like the following:

Load xclbin using DOWNLOAD ioctl Discover compute unit register map from xclbin Allocate data buffers to feed to the compute units using CREATE_BO/MAP_BO ioctl calls Migrate input data buffers from host to device using SYNC_BO ioctl Allocate an execution command buffer using CREATE_BO/MAP_BO ioctl call and fill the command buffer using data in 2 above and following the format defined in ert.h Submit the execution command buffer using EXECBUF ioctl Wait for completion using POSIX poll Migrate output data buffers from device to host using SYNC_BO ioctl Release data buffers and command buffer

Are there sample codes provided somewhere?

Thanks & Regards

uday610 commented 1 year ago

no, those IOCTL driver-level details are internal descriptions, and not for the end-user.

For end-user we support User-space APIs (XRT native C++ API, and XRT native C API to some extent).

That said, XRT is open source and you can look at any IOCTL code, but those are not something we expect end-user to do anything, and using them could be using at your own risk.

qianyich commented 1 year ago

Ok, that makes sense to me.

There are just a couple of things I would like to confirm:

  1. When dealing with buffer data transfer, the XDMA driver is used for DMA. For writing registers of an FPGA kernel, MMIO will be used. Is this correct?
  2. data migration starts/ends in the kernel space in either direction(device to host/host to device). With that being said, for the host application to send/receive data to/from the device, there is one data copy needed between Linux userspace and kernel space. Correct me if I am wrong.

Thanks & Regards

uday610 commented 1 year ago
  1. Correct
  2. Incorrect. User-space access the buffers, allocated by the driver through the pointer. So there is no additional copy between user-space and driver-space
qianyich commented 1 year ago

Something is still confused to me. It looks like the data transfer completely bypass the kernel?

I just want to figure out what actually happened when an OpenCL Buffer object or xrt::bo object is created. Based on what you said, those buffers are allocated by the driver through a pointer in the kernel, and they essentially reside in the user-space. So for any data transfer between host and device, data is directly copied between host user-space and device-space. Are those memory buffers visible to the kernel? One way I know to achieve such a zero-copy is that you have a shared memory space between user and kernel space, and then malloc a buffer in the user-space and update in the kernel page table will allow the kernel to access this memory buffer?

What was actually happened under the hood when we create a device buffer here? I really appreciate your help!

Thanks!

uday610 commented 1 year ago

When we ask XRT to create a BO, then the driver (kernel space) allocates memory both in the Host System Memory and also in the device Memory. When doing bo::sync operation then DMA happens through PCIe so that synchronization happens between the Host Memory and Device Memory. After creating BO, user-space calls the MAP function (bo::map) to get the pointer by which it access the data (for reading or writing)

qianyich commented 1 year ago

Back up a bit. When you mentioned Host System Memory, does it mean a shared memory region that is visible to both user-space and kernel-space? What I am not understanding is where is the memory allocated by the driver? In the Linux kernel or user space? (This is important because if it is in the kernel space, user-space application has no way to directly access it to the best of my knowledge except that the memory region is shared across user and kernel space, and a data copy will degrade performance)

Then the second question: Is the allocated memory for BO a shared memory?