UCLA-VAST / tapa

TAPA is a dataflow HLS framework that features fast compilation, expressive programming model and generates high-frequency FPGA accelerators.
https://tapa.rtfd.io
MIT License
144 stars 27 forks source link

Inquire about host kernel invoking #128

Open ueqri opened 1 year ago

ueqri commented 1 year ago

Hi, in our experience with TAPA, the tapa::invoke in host side is quite time-consuming when we want to re-run a kernel multiple times - involving two unnecessary procedures: (1) re-programming bitstream to FPGA, (2) re-transferring data back and forth.

Specifically, for this workflow in host program:

  1. Set kernel arguments
  2. Write scalar and buffer to device
  3. Execute kernel
  4. Read partial results (not full buffer) from device
  5. Re-set kernel arguments
  6. Write only scalar to device
  7. Re-execute kernel (w/ different arguments)
  8. Repeat (4) until exit condition is true
  9. Read all outputs (full buffer) from device

We found there is no API supported (for host program) in TAPA library for such workflow, unless (1) directly using fpga-runtime library as this, or (2) using XRT/OpenCL APIs manually w/o above wrappers (but it is tricky and may interfere other parts like TAPA simulation).

So I was wondering if there is any chance so far to implement the above workflow with TAPA, or any plan to support a fine-grained kernel invoking in the future? Thank you!

(PS: in addition to implement this workflow from host side, we have considered an on-board scheduler, but we believe it may be detrimental to frequency and area since our kernel is already large enough.)

Blaok commented 1 year ago

I would recommend using the FPGA runtime library unless by "read partial results" you mean reading only part of a single buffer, which can be achieved only with the XRT/OpenCL APIs. You can use fpga::Instance::SuspendBuf to skip a buffer during WriteToDevice and ReadFromDevice. To re-enable a buffer for data transfer, simply call SetArg again.

Example:

// This is the kernel function.
void Kernel(int scalar, tapa::mmap<float> buf1, tapa::mmap<float> buf2);

// Make sure to 4k-align the host memory to avoid additional copy.
template <typename T>                                              
using aligned_vector = std::vector<T, tapa::aligned_allocator<T>>; 

...
int main() {

  // Load the bitstream and acquire the FPGA resource.
  fpga::Instance instance("/path/to/bitstream");

  // Prepare the arguments.
  int scalar_arg = 42;  // Make sure the scalar arguments have the correct size.
  aligned_vector<float> buf1_vec, buf2_vec;
  // Note that fpga::ReadOnly corresponds to tapa::write_only_mmap and
  // fpga::WriteOnly corresponds to tapa::read_only_mmap. This is because
  // the FPGA runtime library is host-centric but TAPA is kernel-centric.
  auto buf1_arg = fpga::ReadWrite(buf1_vec.data(), buf1_vec.size());
  auto buf2_arg = fpga::ReadWrite(buf2_vec.data(), buf2_vec.size());

  // Set kernel arguments.
  instance.SetArgs(scalar_arg, buf1_arg, buf2_arg);
  /* This is equivalent to the following:
   *  instance.SetArg(0, scalar_arg);
   *  instance.SetArg(1, buf1_arg);
   *  instance.SetArg(2, buf2_arg);
   */

  // Write both buffers to device
  instance.WriteToDevice();

  for (;;) {
    // Execute kernel
    instance.Exec();

    // Wait until previous operations finish
    instance.Finish();

    // Read only buf1 from device (skip buf2)
    instance.SuspendBuf(2);
    instance.ReadFromDevice();

    if (...) break;
    // ...

    // Re-set kernel arguments (scalar arguments do not need explicit WriteToDevice)
    instance.SetArg(0, scalar_arg);

    // Re-execute kernel (w/ different arguments) in the next iteration
  };

  // Read all outputs (full buffer) from device
  instance.SetArg(2, buf2_arg);
  instance.ReadFromDevice();
  instance.Finish();

  // ...
}
ueqri commented 1 year ago

Thanks for you kind reply and example! It really helps for our development.

BTW, is there any chance to use the TAPA-style reinterpret method like tapa::read_only_mmap<T0>(buf).reinterpret<T1>() in fpga-runtime API? If so, could you enlighten me about the best practice? I didn't find the hints in the library codebase. Thank you!

Blaok commented 1 year ago

It is not available in the FPGA runtime library today. I'm happy to port it there, but I cannot promise any timeline. If you need it urgently (in 1 week or so), fpga::ReadOnly(reinterpret_cast<T1*>(buf.data()), buf.size() * sizeof(T0) / sizeof(T1)) would work, assuming 1) buf.data() is properly aligned, and 2) buf.size() * sizeof(T0) can by evenly divided by sizeof(T1).

Blaok commented 1 year ago

@ueqri FYI I added a Reinterpret method. Please upgrade to libfrt-dev 0.0.20221212.1 to use it.

linghaosong commented 10 months ago

It is not available in the FPGA runtime library today. I'm happy to port it there, but I cannot promise any timeline. If you need it urgently (in 1 week or so), fpga::ReadOnly(reinterpret_cast<T1*>(buf.data()), buf.size() * sizeof(T0) / sizeof(T1)) would work, assuming 1) buf.data() is properly aligned, and 2) buf.size() * sizeof(T0) can by evenly divided by sizeof(T1).

xilinx 官方例子又不说清楚,真坑,看到这里意识到先前的一个bug。。。

linghaosong commented 10 months ago

BTW, I post here how to use XCL to do pointer cast between host the kernel for the reference of someone who has similar issue and happens to walk here.

You need to do cl::Buffer(context, CL_MEM_XXXX | CL_XXXXX, arr.size() * sizeof(T_data_type_HOST),reinterpret_cast<T_data_type_KERNEL*>(arr.data()), &err));

Be careful with the size you enter for the cl::Buffer, it is the size of bytes in the arr.