[WIP] Backend OpenCL for SDAccel

hecmay commented 4 years ago

Major changes in this PR:

The list of changes looks a bit messy now. I will fix the remaining bugs and separate these changes into 7 different PRs.

Host & device code splitting : this IR pass groups the statements into blocks based no their device scope information. All statements residing on a xcel scope will be reformed into a kernel function. And this kernel function will be offloaded to the accelerator device.
Unified simulation flow with clean interface: the tool mode can be easily modified in the python frontend, as shown in the following snippet. By calling the compiled HeteroCL function in python, the host and accelerator binaries will be compiled and generated under the local path. The users can directly invoke the hardware function through python frontend (the argument value will be transmitted from python to host binary using shared memory)

tool = hcl.tool.sdaccel
tool.mode = "sw_emu" # "hw" or "hw_emu"
target = hcl.platform.aws_f1(tool)

support declarative programming in hcl.def_: the statements placed under the hcl.def_ are all grouped into a super stage in HeteroCL, which will be later used to create a kernel node in the IR tree. The users can now easily use hcl.compute or hcl.update in the kernel function definition. Example:
```
@hcl.def_([A.shape, B.shape])
def kernel(A. B): 
hcl.update(B, lambda *args: A[args] + 1)
```
OpenCL code generator with streaming support For now we can support streaming from host to global memory, as well as streaming between kernels using pipes. Note that the streaming between sub-functions in a kernel function is not allowed in SDAccel. Examples as followed:
```
    s.partition(A, factor=2)                                    
    s.to(A, target.xcel, mode="burst")                               
    s.to(B, target.host)                                
    s.to(A, s[kernel_1], s[kernel_2])
```
Auto-Tuning Integration (into another PR) The auto-tuning can be used to perform auto-scheduling on HeteroCL program, but the performance is not good using the generic heuristics. For now the Auto-Tuning API in HeteroCL is only used for quantization scheme tuning. The implemented API is shown as followed: After calling this API, the control will be handed over to uptune. I will separate this part into another PR later.
```
def kernel(inputs):
# ...
s = hcl.create_schedule([inputs], kernel)
hcl.tune(s, kernel, target)
```
Auto-scheduling with analytical model (into another PR) Auto-scheduling in HeteroCL performs analysis on adjacent compute stages in the HeteroCL program. The analytical model is very weak now. It can only check the reusability of a certain variable and apply reuse_at and compute_at schedules on the stage with optimization opportunities. the API is similar as hcl.tune()

s = hcl.create_schedule([inputs], kernel)
hcl.autosch(s, target)

New hlib library and test cases (into another PR) To make my life easier, I created many hlib function along with the test cases. Here is a list of the newly added function and examples to use them.

# hlib/nn.py
# 1. inter-channel local response norm
# 2. batch norm 

# hlib/function.py (create a HeteroCL module)
# 1. sorting function  
# 2. argmax function
# 3. conv2d function  

# hlib/math.py (create a new stage)
# 1. sorting function 
# 2. argmax function

zhangzhiru commented 4 years ago

Have we also checked in the unit test cases for the OpenCL support? If not, we need to reopen this issue until the test is completed.

seanlatias commented 4 years ago

This PR was targeting the wrong master branch and that's why it's closed. Nothing is merged.

zhangzhiru commented 4 years ago

Then be sure to add comments before we close a PR.

cornell-zhang / heterocl

[WIP] Backend OpenCL for SDAccel #152