cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

[Backend][hlib][v0.3] External IPs Integration Support for HeteroCL #170

Closed hecmay closed 4 years ago

hecmay commented 4 years ago

In this PR, we enable support for HLS and RTL IPs integration into HeteroCL. The external IPs are pre-defined functions in hlib, consisting of both functional behavior level description (used for LLVM JIT simulation) and IP information (e.g. interface ports, IP file directory). Take vector add RTL IP as an example. It requires users to call the pre-defined function in hlib.


        A = hcl.placeholder(in_shape, name="A")
        B = hcl.placeholder(in_shape, name="B")

         def func(A, B):
            return hlib.op.extern.vector_add_rtl(A, B)

         s = hcl.create_schedule([A, B], func)

The IP integration will happen in the code generation phase, where the code generator creates the corresponding Makefile and XML options to integrate the RTL / HLS IPs.

Tutorial on Adding HLSC/OpenCL IP in to HeteroCL:

This tutorial will walk you through the main steps to create, simulate and deploy a HLS (i.e., HLSC or OpenCL) IP into HeteroCL. We will take FFT (Fast Fourier Transformation) as an example. How FFT algorithm works is out of scope for this tutorial. Please check this link if you are interested.

Create a behavior level function

the behavior level function is the functionally equivalent HeteroCL code of the HLS IP to be integrated. This part is recommended if you want to verify the IP works correctly along with other components in the program using HeteroCL LLVM JIT simulation. The HeteroCL version of FFT is available in the master branch.

For the HeteroCl implementation of the algorithm, you can either create & return tensors, or update the passed-in tensors. The algorithm part should be wrapped with a HeteroCl super stage, hcl.Stage("ExternModule") in this example. If you do not want to run any SW simulation, simply creating some dummy HeteroCL statements under the super stage should also work (not recommended).

     import heterocl as hcl
     from hlib.op.extern import create_top_module

     def fft_module(X_real, X_img)
        # step 1. create behavior function for soft ip
        with hcl.Stage("ExternModule") as Module:
            # implement the FFT logic in HeteroCL API
             # hcl.update(X_real, lambda *args: ...)
             # return hcl.compute((L,), lambda i: ... )

         # step 2. configure the soft ip 
         # IP function name 
         dicts["name"] = "hls::fft<config>"
         # tensor inputs (name, dtype tuple) to the IP function
         tensors = [X_real, X_imag, F_real, F_imag]
         dicts["args"] = [(_.name, _.dtype) for _ in tensors]

        # ip function headers and calling convention 

         dicts["header"] = 
"""
#include \"hls_fft.h\"
#include <complex>
struct config : hls::ip_fft::params_t {
  static const unsigned ordering_opt = hls::ip_fft::natural_order;
  static const unsigned config_width = 16; // FFT_CONFIG_WIDTH
};
typedef std::complex<ap_fixed<16,1>> fxpComplex;
"""

     # statements to be inserted before IP function 
     dicts["ip_func"] = 
"""
hls::ip_fft::config_t<config> fft_config;
hls::ip_fft::config_t<config> fft_status; 
fft_config.setDir(0);
fft_config.setSch(0x2AB);
complex<ap_fixed<16,1>> xn[{}];
complex<ap_fixed<16,1>> xk[{}];
for (int i = 0; i < {}; i++)
    xn[i] = fxpComplex({}[i], {}[i]);
hls::fft<config>(xn, xk, &fft_config, &fft_status);
for (int i = 0; i < {}; i++) {{
    {}[i] = xk.real();
    {}[i] = xk.imag();
}}
""".format(L, L, L, X_real.name, X_imag.name,
             L, F_real.name, F_imag.name)

         # dictionary specifying header, pre-function and post-function cfg
         create_top_module(Module, dicts, ip_type="hls")

Configure the inputs, outputs and core logic for software IP module

To configure the IP and let HeteroCL integrate your IP, you need to pass the IP information into the create_top_module function provided by HeteroCL, as shown in the snippet above. We sue this function to create a top-level module (which will be mapped to e.g. an OpenCL kernel function in the code generation stage) for the soft IP. We also support integrating the IP within a top module.

The dicts argument is the core of HLS IP integration process, in which we allowed users to directly insert raw HLS statements into HeteroCL program. Since most of advanced C/C++ features cannot be expressed with HeteroCL, we leave the IP configuration to users to keep the flexibility of IP integration. Users are allowed to insert HLS code in header, right before and after the IP function.

Notice that the inputs and outputs arguments must be tensors, and if the users want to use some IP function with other data types, like complex data type in the example, the conversion logic must be implemented using dicts["ip_func"]. In the later release, we need to add some automatic detection algorithm to generate data type conversion logic.

Data movement with HLS IP

There are three IP types (RTL / HLS / Host). The IP core of type RTL and HLS must be moved to device scope using .to (as shown in the example in the snippet below). The IP core is the minimum placement uint in the view of data placement API. Namely, you cannot move any tensors inside an IP core back and forth between device and host.

        A = hcl.placeholder(in_shape, name="A")
        B = hcl.placeholder(in_shape, name="B")

        def kernel(A, B):
            real, imag = fft_module(A, B)
            return hcl.compute((length,), lambda x:
                 hcl.sqrt(real[x] * real[x] + imag[x] * imag[x]), name="abs")

        s.to([A, B], target.xcel)

The code for this example is available here: https://github.com/Hecmay/heterocl/blob/extern/hlib/python/hlib/op/extern.py#L202

zhangzhiru commented 4 years ago

This is an excellent starting pointing.

What does .op mean? Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.

hecmay commented 4 years ago

This is an excellent starting pointing.

What does .op mean? Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.

.op means operator. The hlib.op includes many common operations (e.g. exp or NN layers). I put the external IP APIs in the same level for regularity and consistency. For now all of the external libs are under the hlib folder.

Each IP core will be marked with a specific attribute, indicating its targeting FPGA and levels of abstraction. I will also add another IR pass to support automatic data type transformation for the external IP calls (e.g. transforming a tensor to hls::stream<ap_axiu<>>)

hecmay commented 4 years ago

New features introduced in this PR:

  1. Code Generator to generate TCL / Makefile: The integrated RTL IP is considered as a blackbox, with which we need to add some additional flags to Makefile as well as extra TCL scripts to specify the port interface of the IP.
  2. New IR node for device placement: a new ExternModule IR is introduced in this PR. This IR node will wrap all statements running on a specific device (e.g. SSD or another node in the cluster). The new code generator gives us more flexibility to support different devices with various requirement.
s.to(tensorA, target.host.Flash)
s.to(tensorB, target.HBM)
hecmay commented 4 years ago

Integration granularity of the external RTL IPs.

Ideally we want to integrate all the RTL IPs as blackboxes into our kernel program, where we can simply call the RTL IP as a sub-function, and EDA tool will replace the function call with the user-provided RTL code.

def kernel(image):
    out1 = hlib.op.extern.rtl.image_filter(image)
    out2 = hlib.op.extern.rtl.refine(out1)
    return out2

s.to(image, target.xcel)
s.to(kernel.out2, target.host)

However, to integrate the RTL IPs into HLSC program, we need a interface specification configuration file like: https://github.com/Xilinx/HLS-Tiny-Tutorials/blob/master/misc_rtl_as_blackbox/rtl_model.json, which is oftentimes unavailable from neither the users or HeteroCL.

seanlatias commented 4 years ago

Can you fix the tests?

seanlatias commented 4 years ago

Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.

hecmay commented 4 years ago

Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.

Moved the tutorial to the top. Will fix the test now.

hecmay commented 4 years ago

Data Movement in Heterogenous Memory System

In this proposal we use HBM as an example. The channel or bank allocation for DDR and PLRAM fits well with the same interface proposed here.

The assignment of HBM channels comes along with compute unit (CU) replication. We are supposed to assign different channels to each arguments in each CU duplicate to maximize the bandwidth. Here is the proposed interface:

1) we can specify the kernel number (i.e. how many CU to duplicate) in the data movement API with splitting_factor option. In this case, multiple CU duplicates are created, inputs will be split evenly and assigned to different HBM channels (If the total # greater then 32, some arguments will be assigned to the same HBM channel)

2) split the input tensors in a single dimension using splitting_dim option. In this case, we can reshape the input tensors, and split the tensors along certain dimension. In this example, we split the input tensor along the 0-th dimension, and 16 CU duplicates are generated accordingly.

  A = hcl.placeholder(in_shape, name="A")
  B = hcl.placeholder(in_shape, name="B")

  def kernel(...):
      # algorithm...

  # create custom platform 
  config = {
      "host": hcl.device.cpu("intel", "e5"),
      "xcel": {
          hcl.device.fpga("xilinx", "xcvu19p"),
          hcl.device.gpu("nvidia", "gtx-1080") 
      }
  }
  p = hcl.platform.custom(config)

  # case 1. move tensors to HBM with splitting factor: the input tensors are 
  # split into multiple pieces and each piece assigned to a separate CU
  A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

  # case 2. assign the channel explicitly with a single CU
  A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)

  # case 3. reshape and split along certain dimmension  
  s.reshape([A, B], (2, 16))
  A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)
zhangzhiru commented 4 years ago

A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)

This is a good starting point. As always, we need to streamline the terms. Does bank correspond to a virtual channel? Also I suggest we use bank[0] instead of bank0

zhangzhiru commented 4 years ago

case 1. move tensors to HBM with splitting factor: the input tensors are split into multiple pieces and each piece assigned to a separate CU A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.

zhangzhiru commented 4 years ago

s.reshape([A, B], (2, 16)) A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)

Similar to my previous comment, we shall cascade .to() with another reshape/partition primitive. It's really important to not to entangle multiple optimizations in one primitive.

hecmay commented 4 years ago

case 1. move tensors to HBM with splitting factor: the input tensors are split into multiple pieces and each piece assigned to a separate CU A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.

We do not have such a kernel here to apply parallel primitive. That's why I used this entangled approach as a workaround. All stages dependent on the tensors moved to device form a kernel: as shown in the example here. If we move tensor A and B to device, and move tensor ret back to host, then the combination of all stages in the middle (i.e. stage 1 to k) is considered as the kernel in this program.

A = hcl.placeholder((10,))
B = hcl.placeholder((10,))

# stage 1 to stage k
# .... compute something

ret = hcl.compute((10,), lambda *args: ...)

I cannot find a clean and concise way to specify the range of the kernel. @seanlatias Do you have any suggestion?

zhangzhiru commented 4 years ago

split into multiple pieces and each piece assigned to a separate CU

I thought the CU you're referring to here has to correspond to a compute kernel that needs to be duplicated? If not, why are we moving the tensor to the device?

hecmay commented 4 years ago

The discussion for heterogeneous memory placement has been moved to #180.