Closed hecmay closed 4 years ago
This is an excellent starting pointing.
What does .op mean? Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.
This is an excellent starting pointing.
What does .op mean? Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.
.op
means operator. The hlib.op
includes many common operations (e.g. exp or NN layers). I put the external IP APIs in the same level for regularity and consistency. For now all of the external libs are under the hlib
folder.
Each IP core will be marked with a specific attribute, indicating its targeting FPGA and levels of abstraction. I will also add another IR pass to support automatic data type transformation for the external IP calls (e.g. transforming a tensor to hls::stream<ap_axiu<>>
)
New features introduced in this PR:
ExternModule
IR is introduced in this PR. This IR node will wrap all statements running on a specific device (e.g. SSD or another node in the cluster). The new code generator gives us more flexibility to support different devices with various requirement.s.to(tensorA, target.host.Flash)
s.to(tensorB, target.HBM)
Integration granularity of the external RTL IPs.
Ideally we want to integrate all the RTL IPs as blackboxes into our kernel program, where we can simply call the RTL IP as a sub-function, and EDA tool will replace the function call with the user-provided RTL code.
def kernel(image):
out1 = hlib.op.extern.rtl.image_filter(image)
out2 = hlib.op.extern.rtl.refine(out1)
return out2
s.to(image, target.xcel)
s.to(kernel.out2, target.host)
However, to integrate the RTL IPs into HLSC program, we need a interface specification configuration file like: https://github.com/Xilinx/HLS-Tiny-Tutorials/blob/master/misc_rtl_as_blackbox/rtl_model.json, which is oftentimes unavailable from neither the users or HeteroCL.
Can you fix the tests?
Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.
Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.
Moved the tutorial to the top. Will fix the test now.
In this proposal we use HBM as an example. The channel or bank allocation for DDR and PLRAM fits well with the same interface proposed here.
The assignment of HBM channels comes along with compute unit (CU) replication. We are supposed to assign different channels to each arguments in each CU duplicate to maximize the bandwidth. Here is the proposed interface:
1) we can specify the kernel number (i.e. how many CU to duplicate) in the data movement API with splitting_factor
option. In this case, multiple CU duplicates are created, inputs will be split evenly and assigned to different HBM channels (If the total # greater then 32, some arguments will be assigned to the same HBM channel)
2) split the input tensors in a single dimension using splitting_dim
option. In this case, we can reshape the input tensors, and split the tensors along certain dimension. In this example, we split the input tensor along the 0-th dimension, and 16 CU duplicates are generated accordingly.
A = hcl.placeholder(in_shape, name="A")
B = hcl.placeholder(in_shape, name="B")
def kernel(...):
# algorithm...
# create custom platform
config = {
"host": hcl.device.cpu("intel", "e5"),
"xcel": {
hcl.device.fpga("xilinx", "xcvu19p"),
hcl.device.gpu("nvidia", "gtx-1080")
}
}
p = hcl.platform.custom(config)
# case 1. move tensors to HBM with splitting factor: the input tensors are
# split into multiple pieces and each piece assigned to a separate CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)
# case 2. assign the channel explicitly with a single CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)
# case 3. reshape and split along certain dimmension
s.reshape([A, B], (2, 16))
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)
This is a good starting point. As always, we need to streamline the terms. Does bank correspond to a virtual channel? Also I suggest we use bank[0] instead of bank0
case 1. move tensors to HBM with splitting factor: the input tensors are split into multiple pieces and each piece assigned to a separate CU A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)
I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.
s.reshape([A, B], (2, 16)) A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)
Similar to my previous comment, we shall cascade .to() with another reshape/partition primitive. It's really important to not to entangle multiple optimizations in one primitive.
case 1. move tensors to HBM with splitting factor: the input tensors are split into multiple pieces and each piece assigned to a separate CU A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)
I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.
We do not have such a kernel here to apply parallel
primitive. That's why I used this entangled approach as a workaround. All stages dependent on the tensors moved to device form a kernel: as shown in the example here. If we move tensor A
and B
to device, and move tensor ret
back to host, then the combination of all stages in the middle (i.e. stage 1 to k) is considered as the kernel in this program.
A = hcl.placeholder((10,))
B = hcl.placeholder((10,))
# stage 1 to stage k
# .... compute something
ret = hcl.compute((10,), lambda *args: ...)
I cannot find a clean and concise way to specify the range of the kernel. @seanlatias Do you have any suggestion?
split into multiple pieces and each piece assigned to a separate CU
I thought the CU you're referring to here has to correspond to a compute kernel that needs to be duplicated? If not, why are we moving the tensor to the device?
The discussion for heterogeneous memory placement has been moved to #180.
In this PR, we enable support for HLS and RTL IPs integration into HeteroCL. The external IPs are pre-defined functions in
hlib
, consisting of both functional behavior level description (used for LLVM JIT simulation) and IP information (e.g. interface ports, IP file directory). Take vector add RTL IP as an example. It requires users to call the pre-defined function inhlib
.The IP integration will happen in the code generation phase, where the code generator creates the corresponding Makefile and XML options to integrate the RTL / HLS IPs.
Tutorial on Adding HLSC/OpenCL IP in to HeteroCL:
This tutorial will walk you through the main steps to create, simulate and deploy a HLS (i.e., HLSC or OpenCL) IP into HeteroCL. We will take FFT (Fast Fourier Transformation) as an example. How FFT algorithm works is out of scope for this tutorial. Please check this link if you are interested.
Create a behavior level function
the behavior level function is the functionally equivalent HeteroCL code of the HLS IP to be integrated. This part is recommended if you want to verify the IP works correctly along with other components in the program using HeteroCL LLVM JIT simulation. The HeteroCL version of FFT is available in the master branch.
For the HeteroCl implementation of the algorithm, you can either create & return tensors, or update the passed-in tensors. The algorithm part should be wrapped with a HeteroCl super stage,
hcl.Stage("ExternModule")
in this example. If you do not want to run any SW simulation, simply creating some dummy HeteroCL statements under the super stage should also work (not recommended).Configure the inputs, outputs and core logic for software IP module
To configure the IP and let HeteroCL integrate your IP, you need to pass the IP information into the
create_top_module
function provided by HeteroCL, as shown in the snippet above. We sue this function to create a top-level module (which will be mapped to e.g. an OpenCL kernel function in the code generation stage) for the soft IP. We also support integrating the IP within a top module.The
dicts
argument is the core of HLS IP integration process, in which we allowed users to directly insert raw HLS statements into HeteroCL program. Since most of advanced C/C++ features cannot be expressed with HeteroCL, we leave the IP configuration to users to keep the flexibility of IP integration. Users are allowed to insert HLS code in header, right before and after the IP function.Notice that the inputs and outputs arguments must be tensors, and if the users want to use some IP function with other data types, like complex data type in the example, the conversion logic must be implemented using
dicts["ip_func"]
. In the later release, we need to add some automatic detection algorithm to generate data type conversion logic.Data movement with HLS IP
There are three IP types (RTL / HLS / Host). The IP core of type RTL and HLS must be moved to device scope using
.to
(as shown in the example in the snippet below). The IP core is the minimum placement uint in the view of data placement API. Namely, you cannot move any tensors inside an IP core back and forth between device and host.The code for this example is available here: https://github.com/Hecmay/heterocl/blob/extern/hlib/python/hlib/op/extern.py#L202