[API] Data placement and streaming API rewrite

hecmay commented 4 years ago

[x] Fix the graph hierarchy issue (by reconstructing a call graph from the flattened graph)
[x] Moved the code transformation logic to the end of the lowering process. Should fix #160 and fix #262 and fix #207
[x] Removed the automatically created on-chip buffer when moving data to/from the device. Users can choose from DMA or Stream mode for host-device communication (DMA is enabled by default) Should fix #235.
[x] Constraints checking (e.g. one-read-one-write e.t.c)
[x] Fixed the conflicts with resue_at (by adding an IR pass to fix the buffer binding issues). Should fix #264, fix #230 and fix #219 and fix #154
[x] Modified the printing logic of CodeGen VHLS. The function arguments are defined with the explicitly specified range, instead of using a pointer. Should fix #194
[x] Tutorial examples of default and custom platforms
[x] Tutorial examples of .to for on-chip and host-to-device streaming
[x] Support s.subgraph(). This is needed for dataflow support #245
[x] Fix the bugs in test_runtime_build and test_schedule_stream
[x] Added auto data placement logic when lowering the program
[x] Use hls::stream to implement FIFOs in Vivado HLS backend (NB, this is less stable than pragma annotated multi-dimensional array, based on what we observed in Vivado HLS 2019.2). Should fix #286

Other new features

[x] Added HCL_DEBUG_LEVEL(level) MACRO (enabled when HCL_DEBUG_ON is set)
[x] Add burst mode to host-to-device data streaming. Should fix #277
[x] Added Systolic Array with streaming between PE modules (modified from Niansong's code). Should fix #267
[x] Support on-chip data movement (e.g. s.to(tensor, p.xcel.BRAM)). Should fix #281
[x] Added module mode for hcl.compute. Should fix #283.
[x] Copied loop labeling from #259 authored by Hongzheng (not sure why the previous build didn't pass...)
[x] Added JSON library to replace the shared memory in HCL runtime. Should fix #290 and fix #241. During the testing, I will get some errors from python Error in python: double free or corruption (!prev)
[x] Added automatic stage re-naming for duplicate names. Should fix https://github.com/cornell-zhang/heterocl-docs/issues/44
[x] Turned off the auto-function inlining in VHLS backend. Should help fix this issue: https://github.com/cornell-zhang/heterocl-docs/issues/40
[x] [2020-10] Updated hlib BNN and NN libraries (copied from Hongzheng's repo)
[x] [2020-10] Added ReActNet exmpale (copied from Hongzheng's repo)
[x] [2020-10] Added const array support (merged from Sean's PR)
[x] [2020-10] Supported fixed-point to ap int dtype conversion in Intel AOCL
[x] [2020-10] Supported set bit slicing op in Intel AOCL backend
[x] [2020-10] Added Insider backend (not well tested yet. should solve #292)
[x] [2020-10] Support fluent programming style for .to (e.g. s.to(A, dest1).to(dest2) to specify data flow)
[x] [2020-10] Support explicit unrolling (should fix #308)
[x] [2020-11] Support querying input tensors of a TVM stage
[ ] [2020-11] Support static variable inside HCL module
[x] [2020-12] Added Auto-SA integration interface
[x] [2020-12] Supported weight stationary SA generation with .to scheduling.
[x] [2020-12] Integration with SODA using extern module (using .to)
[ ] [2020-12] Added the dataflow primitive for both loop and function level

Fixed bugs

[x] Fix #260 Cannot quantize data in Vitis flow
[x] Added checking logic for top-function input arg's bitwidth (must be multiple for 8 for Vitis backend),. Should fix #261
[x] Added checking logic for inter-stage streaming channels. Should fix #270
[x] Added naming to array partition primitive (prevent stages with duplicate names). Should fix #273
[x] Added stage name-checking logic to prevent naming duplications. Should fix #218, fix #250 and fix #197
[x] Added fix to handle the disconnected graph. Should fix #274, fix #240 and fix #271
[x] Added condition check on FIFO consumers loading. Should fix #284
[x] Fixed issue of loop labelling (convert illegal label names like loop.1 to loop_1)
[x] [2020-10] Cast fixed-point data type to floating in AOCL backend. Should fix #298
[x] [2020-10] Added analysis function to analyze input IP files, and insert the non-inlined function call. Should fix #297
[x] [2020-10] Added a backup hlib.nn library for BNN designs (with custom data types)
[x] [2020-10] Fixed OpenCL code generation issues (use 1D flattened index by default)

hecmay commented 4 years ago

@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.

You mean Vivado HLS also requires the port width to be multiple of 8? No. Vivado HLS does not require the interface pragmas. It is only required in Vits 2019.2.

I mean our codegen will generate #pragma HLS INTERFACE for Vivado HLS now, which requires the bitwidth of the input argument to be a multiple of 8.

Removed now. Now the interface pragmas are only intended for Vitis flow.

hecmay commented 4 years ago

It is interesting. I just copied @chhzh123 your changes to this PR, and got the same error from Keras, even though we did not touch Keras at all...

chhzh123 commented 4 years ago

It is interesting. I just copied @chhzh123 your changes to this PR, and got the same error from Keras, even though we did not touch Keras at all...

Maybe the Keras test can be commented out. I think it is not our major focus at this time.

chhzh123 commented 4 years ago

The following stages cannot be streamed correctly, which declares buffers first and updates them using imperative grammas.

def test_zero():
    A = hcl.placeholder((10,), "A")

    def kernel(A):
        B = hcl.compute(A.shape, lambda i: A[i] + 1, "B")
        C1 = hcl.compute(A.shape, lambda i: 0, "C1")
        C2 = hcl.compute(A.shape, lambda i: 0, "C2")
        def foo(i):
            C1[i] = B[i] + 1
            C2[i] = C1[i] + 1
        hcl.mutate((10,), lambda i: foo(i), "C")
        D = hcl.compute(A.shape, lambda i: C2[i] + 1, "D")
        return D

    target = hcl.platform.zc706
    target.config(compile="vivado_hls", mode="csim")
    s = hcl.create_schedule([A], kernel)
    s.to([A], target.xcel)
    s.to(kernel.D, target.host)
    s.to(kernel.B, s[kernel.C1])
    s.to(kernel.C2, s[kernel.D])
    f = hcl.build(s, target)
    np_A = np.zeros((10,))
    np_D = np.zeros((10,))
    hcl_A = hcl.asarray(np_A)
    hcl_D = hcl.asarray(np_D)
    f(hcl_A, hcl_D)

The generated code does not capture the C1 and C2 arrays in the foo function, but capture the zero placeholders.

void test(bit32 A[10], bit32 D[10]) {
    bit32 _top;
    bit32 B[10];
    bit32 B_pipe_1[10];
    #pragma HLS dataflow
    #pragma HLS stream variable=B_pipe_1 depth=1
    B_i: for (bit32 i = 0; i < 10; ++i) {
      bit32 B_temp;
      B_temp = (A[i] + 1);
      B_pipe_1[i] = B_temp;
    }
    bit32 C1[10];
    bit32 C2[10];
    bit32 C2_pipe_2[10];
    #pragma HLS stream variable=C2_pipe_2 depth=1
    C2_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
      bit32 C2_temp;
      C2_temp = 0;
      C2_pipe_2[i1] = C2_temp;
      C2[i1] = C2_temp;
    }
    bit32 C;
    C_i2: for (bit32 i2 = 0; i2 < 10; ++i2) {
      C1[i2] = (B[i2] + 1);
      C2[i2] = (C1[i2] + 1);
    }
    D_i3: for (bit32 i3 = 0; i3 < 10; ++i3) {
      bit32 C2_temp1;
      C2_temp1 = C2_pipe_2[i3];
      D[i3] = (C2_temp1 + 1);
    }
  }

hecmay commented 4 years ago

The generated code does not capture the C1 and C2 arrays in the foo function

I do not quite understand what you mean here. The C1 and C2 arrays are both initialized with all zero values, so in the generated code, we just create two allocate statements for them.

The FIFO is not created correctly. You should consider using the following style

    # Stage C is the actual consumer of tensor B 
    s.to(kernel.B, s[kernel.C])
    # You want to stream the C2 (the C2 modified after C stage) to the D stage
    s.to(kernel.C.C2, s[kernel.D])

I just ran it on the server. And it should be able to generate the code we expected. Please see the test case: https://github.com/Hecmay/heterocl/blob/fix/tests/issues/test_issue_284.py#L29

chhzh123 commented 4 years ago

The FIFO is not created correctly. You should consider using the following style

    # Stage C is the actual consumer of tensor B 
    s.to(kernel.B, s[kernel.C])
    # You want to stream the C2 (modified after C stage to the D stage)
    s.to(kernel.C.C2, s[kernel.D])

Yes, it works. Thanks!

hecmay commented 4 years ago

I added rapidJSON, a C++ (header-only) library for JSON into HCL codebase. Since the library is not too big, can we just include the source code in our repo? If that is not preferable, we can git clone it. @seanlatias @zhangzhiru

hecmay commented 4 years ago

Error message I got on CI:

=================================== FAILURES ===================================
_______________________________ test_tutorial_08 _______________________________

    def test_tutorial_08():

>       import tutorial_08_backend

tutorials/test_tutorial.py:31: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tutorials/tutorial_08_backend.py:38: in <module>
    f(hcl_A, hcl_B)
../.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/heterocl/tvm/_ffi/function.py:128: in __call__
    return f(*args)
../.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/heterocl/tvm/_ffi/_ctypes/function.py:183: in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def check_call(ret):
        """Check the return value of C API call

        This function will raise exception when error occurs.
        Wrap every API call with this function

        Parameters
        ----------
        ret : int
            return value from API calls
        """
        if ret != 0:
>           raise TVMError(py_str(_LIB.TVMGetLastError()))
E           heterocl.tvm._ffi.base.TVMError: [16:40:41] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) Assert fail: ((((tvm_struct_get(arg1, 0, 5) == (uint8)0) && (tvm_struct_get(arg1, 0, 6) == (uint8)32)) && (tvm_struct_get(arg1, 0, 8) == (uint8)0)) && (tvm_struct_get(arg1, 0, 7) == (uint8)1)), arg1.dtype is expected to be int32
E           
E           Stack trace returned 10 entries:
E           [bt] (0) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(dmlc::StackTrace[abi:cxx11]()+0x40) [0x7ff263dbefa0]
E           [bt] (1) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x29) [0x7ff263dbf689]
E           [bt] (2) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(TVM::codegen::LLVMModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<TVM::runtime::ModuleNode> const&)::{lambda(TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*)#2}::operator()(TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*) const+0x18c) [0x7ff2640b718c]
E           [bt] (3) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(std::_Function_handler<void (TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*), TVM::codegen::LLVMModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<TVM::runtime::ModuleNode> const&)::{lambda(TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*)#2}>::_M_invoke(std::_Any_data const&, TVM::runtime::TVMArgs&&, TVM::runtime::TVMRetValue*&&)+0x17) [0x7ff2640b7267]
E           [bt] (4) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(TVMFuncCall+0x4c) [0x7ff26420344c]

However, I was not able to reproduce this error on our local servers...

chhzh123 commented 4 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

hecmay commented 4 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

Yeah. That's a good suggestion. I can change that. Thanks!

zhangzhiru commented 3 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

What's an unoptimized dataflow program? One that is not actually pipelined? This conservative solution is fine for small tensors. But in general, the area overhead is huge.

chhzh123 commented 3 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

What's an unoptimized dataflow program? One that is not actually pipelined? This conservative solution is fine for small tensors. But in general, the area overhead is huge.

The program that has not been optimized for area. I mean if users do not specify the FIFO depth, then our default depth should be set large enough. If the user knows what he does, then he can set a smaller depth to consume less area.

hecmay commented 3 years ago

I tried to run the KMeans (optimized) design on AWS. It turned out the design synthesized by Vitis HLS has a bit worse performance than Vivado HLS (default HLS tool in Vitis 2019.02). I think I need to fall back to Vitis 2019 on AWS and try again.

seanlatias commented 3 years ago

@Hecmay #259 is merged.

hecmay commented 3 years ago

@whbldhwj JIe. Please see the latest example here: https://github.com/Hecmay/heterocl/blob/fix/tests/test_schedule_systolic.py#L6-L31

The data type has been updated -- we do not have those nested casting any more for AutoSA module. And we also have the header and main body section for the imported IP.

hecmay commented 3 years ago

Moved to #316

cornell-zhang / heterocl

[API] Data placement and streaming API rewrite #265