cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

[API] Data placement and streaming API rewrite #265

Closed hecmay closed 3 years ago

hecmay commented 4 years ago

Other new features

Fixed bugs

hecmay commented 4 years ago

@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.

You mean Vivado HLS also requires the port width to be multiple of 8? No. Vivado HLS does not require the interface pragmas. It is only required in Vits 2019.2.

I mean our codegen will generate #pragma HLS INTERFACE for Vivado HLS now, which requires the bitwidth of the input argument to be a multiple of 8.

Removed now. Now the interface pragmas are only intended for Vitis flow.

hecmay commented 4 years ago

It is interesting. I just copied @chhzh123 your changes to this PR, and got the same error from Keras, even though we did not touch Keras at all...

chhzh123 commented 4 years ago

It is interesting. I just copied @chhzh123 your changes to this PR, and got the same error from Keras, even though we did not touch Keras at all...

Maybe the Keras test can be commented out. I think it is not our major focus at this time.

chhzh123 commented 4 years ago

The following stages cannot be streamed correctly, which declares buffers first and updates them using imperative grammas.

def test_zero():
    A = hcl.placeholder((10,), "A")

    def kernel(A):
        B = hcl.compute(A.shape, lambda i: A[i] + 1, "B")
        C1 = hcl.compute(A.shape, lambda i: 0, "C1")
        C2 = hcl.compute(A.shape, lambda i: 0, "C2")
        def foo(i):
            C1[i] = B[i] + 1
            C2[i] = C1[i] + 1
        hcl.mutate((10,), lambda i: foo(i), "C")
        D = hcl.compute(A.shape, lambda i: C2[i] + 1, "D")
        return D

    target = hcl.platform.zc706
    target.config(compile="vivado_hls", mode="csim")
    s = hcl.create_schedule([A], kernel)
    s.to([A], target.xcel)
    s.to(kernel.D, target.host)
    s.to(kernel.B, s[kernel.C1])
    s.to(kernel.C2, s[kernel.D])
    f = hcl.build(s, target)
    np_A = np.zeros((10,))
    np_D = np.zeros((10,))
    hcl_A = hcl.asarray(np_A)
    hcl_D = hcl.asarray(np_D)
    f(hcl_A, hcl_D)

The generated code does not capture the C1 and C2 arrays in the foo function, but capture the zero placeholders.

void test(bit32 A[10], bit32 D[10]) {
    bit32 _top;
    bit32 B[10];
    bit32 B_pipe_1[10];
    #pragma HLS dataflow
    #pragma HLS stream variable=B_pipe_1 depth=1
    B_i: for (bit32 i = 0; i < 10; ++i) {
      bit32 B_temp;
      B_temp = (A[i] + 1);
      B_pipe_1[i] = B_temp;
    }
    bit32 C1[10];
    bit32 C2[10];
    bit32 C2_pipe_2[10];
    #pragma HLS stream variable=C2_pipe_2 depth=1
    C2_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
      bit32 C2_temp;
      C2_temp = 0;
      C2_pipe_2[i1] = C2_temp;
      C2[i1] = C2_temp;
    }
    bit32 C;
    C_i2: for (bit32 i2 = 0; i2 < 10; ++i2) {
      C1[i2] = (B[i2] + 1);
      C2[i2] = (C1[i2] + 1);
    }
    D_i3: for (bit32 i3 = 0; i3 < 10; ++i3) {
      bit32 C2_temp1;
      C2_temp1 = C2_pipe_2[i3];
      D[i3] = (C2_temp1 + 1);
    }
  }
hecmay commented 4 years ago

The generated code does not capture the C1 and C2 arrays in the foo function

I do not quite understand what you mean here. The C1 and C2 arrays are both initialized with all zero values, so in the generated code, we just create two allocate statements for them.

The FIFO is not created correctly. You should consider using the following style

    # Stage C is the actual consumer of tensor B 
    s.to(kernel.B, s[kernel.C])
    # You want to stream the C2 (the C2 modified after C stage) to the D stage
    s.to(kernel.C.C2, s[kernel.D])

I just ran it on the server. And it should be able to generate the code we expected. Please see the test case: https://github.com/Hecmay/heterocl/blob/fix/tests/issues/test_issue_284.py#L29

chhzh123 commented 4 years ago

The FIFO is not created correctly. You should consider using the following style

    # Stage C is the actual consumer of tensor B 
    s.to(kernel.B, s[kernel.C])
    # You want to stream the C2 (modified after C stage to the D stage)
    s.to(kernel.C.C2, s[kernel.D])

Yes, it works. Thanks!

hecmay commented 4 years ago

I added rapidJSON, a C++ (header-only) library for JSON into HCL codebase. Since the library is not too big, can we just include the source code in our repo? If that is not preferable, we can git clone it. @seanlatias @zhangzhiru

hecmay commented 4 years ago

Error message I got on CI:

=================================== FAILURES ===================================
_______________________________ test_tutorial_08 _______________________________

    def test_tutorial_08():

>       import tutorial_08_backend

tutorials/test_tutorial.py:31: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tutorials/tutorial_08_backend.py:38: in <module>
    f(hcl_A, hcl_B)
../.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/heterocl/tvm/_ffi/function.py:128: in __call__
    return f(*args)
../.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/heterocl/tvm/_ffi/_ctypes/function.py:183: in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def check_call(ret):
        """Check the return value of C API call

        This function will raise exception when error occurs.
        Wrap every API call with this function

        Parameters
        ----------
        ret : int
            return value from API calls
        """
        if ret != 0:
>           raise TVMError(py_str(_LIB.TVMGetLastError()))
E           heterocl.tvm._ffi.base.TVMError: [16:40:41] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) Assert fail: ((((tvm_struct_get(arg1, 0, 5) == (uint8)0) && (tvm_struct_get(arg1, 0, 6) == (uint8)32)) && (tvm_struct_get(arg1, 0, 8) == (uint8)0)) && (tvm_struct_get(arg1, 0, 7) == (uint8)1)), arg1.dtype is expected to be int32
E           
E           Stack trace returned 10 entries:
E           [bt] (0) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(dmlc::StackTrace[abi:cxx11]()+0x40) [0x7ff263dbefa0]
E           [bt] (1) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x29) [0x7ff263dbf689]
E           [bt] (2) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(TVM::codegen::LLVMModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<TVM::runtime::ModuleNode> const&)::{lambda(TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*)#2}::operator()(TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*) const+0x18c) [0x7ff2640b718c]
E           [bt] (3) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(std::_Function_handler<void (TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*), TVM::codegen::LLVMModuleNode::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<TVM::runtime::ModuleNode> const&)::{lambda(TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*)#2}>::_M_invoke(std::_Any_data const&, TVM::runtime::TVMArgs&&, TVM::runtime::TVMRetValue*&&)+0x17) [0x7ff2640b7267]
E           [bt] (4) /home/circleci/.local/lib/python3.6/site-packages/heterocl-0.1-py3.6.egg/lib/libhcl.so(TVMFuncCall+0x4c) [0x7ff26420344c]

However, I was not able to reproduce this error on our local servers...

chhzh123 commented 4 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

hecmay commented 4 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

Yeah. That's a good suggestion. I can change that. Thanks!

zhangzhiru commented 3 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

What's an unoptimized dataflow program? One that is not actually pipelined? This conservative solution is fine for small tensors. But in general, the area overhead is huge.

chhzh123 commented 3 years ago

I think the default FIFO depth should be set as the size of the array, which can guarantee the correctness of an unoptimized dataflow program.

What's an unoptimized dataflow program? One that is not actually pipelined? This conservative solution is fine for small tensors. But in general, the area overhead is huge.

The program that has not been optimized for area. I mean if users do not specify the FIFO depth, then our default depth should be set large enough. If the user knows what he does, then he can set a smaller depth to consume less area.

hecmay commented 3 years ago

I tried to run the KMeans (optimized) design on AWS. It turned out the design synthesized by Vitis HLS has a bit worse performance than Vivado HLS (default HLS tool in Vitis 2019.02). I think I need to fall back to Vitis 2019 on AWS and try again.

seanlatias commented 3 years ago

@Hecmay #259 is merged.

hecmay commented 3 years ago

@whbldhwj JIe. Please see the latest example here: https://github.com/Hecmay/heterocl/blob/fix/tests/test_schedule_systolic.py#L6-L31

The data type has been updated -- we do not have those nested casting any more for AutoSA module. And we also have the header and main body section for the imported IP.

hecmay commented 3 years ago

Moved to #316