[API] New Data Movement Mode w/o Local On-chip Buffer

hecmay commented 4 years ago

This PR will introduce zero-copy mode for data streaming. It should address the needs of #235.

Currently .to provides different modes for host-device data movement:

Streaming (FIFO): input data is streamed into/from host-device pipe. The data from/to the pipe is used immediately instead of being stored in global memory or on-chip buffer.
DMA: In this case, host program will migrate the host memory to global memory, and then create a on-chip buffer on FPGA to store the input data.
MMIO: host program controls the HW by writing to or reading from reserved memory address space. This is not supported yet.

Note that FIFO is the only supported mode for the inter-kernel data movement. Aside from these three modes for host-device data movement, we also expose an option for users to decide whether to create a on-chip local buffer used for storing data.

Users can choose to remove the on-chip buffer by adding local_buffer=False to the .to API. Namely, host still migrates the memory object from system memory to global memory, but instead of creating an on-chip buffer, the kernel function directly access the DDR global memory.

// with on-chip buffer
void test(hls::stream<int> &A_channel, hls::stream<int> &B_channel) {
  int A[10], B[10];
  for (i 0 to 10) {
    A[i] = A_channel.read();
  }
  for (i 0 to 10) {
    B[i] = A[i] + 1;
  }
  for (i 0 to 10) {
    B_channel.write(B[i]);
  }
} 

// without on-chip buffer
void test(int A[10], int B[10]) {
  for (i 0 to 10) {
    B[i] = A[i] + 1;
  }
}

Program in such a mode may suffer from performance degradation. But this mode will not have the burst read/write nested loops, and would make the analysis easier.

chhzh123 commented 4 years ago

I think this mode still puts great burden on programmers (also developers) and does not make synthesis easier. It may also introduce bugs easily. Consider the combination of different target modes (debug/csyn) and different data movement options (FIFO/ZeroCopy), they are NOT using the same codegen logic. Some choices have been checked, but other combinations cause errors without notice.

For example, when ZeroCopy is set, the function signature is again incorrect, which is exactly #194 .

  void test(hls::stream<bit32 >& B, hls::stream<bit32 >& A) {
  #pragma HLS INTERFACE axis port=B offset=slave bundle=gmem0
  #pragma HLS INTERFACE axis port=A offset=slave bundle=gmem1
  #pragma HLS INTERFACE s_axilite port=return bundle=control
    for (bit32 y = 0; y < 8; ++y) {
      for (bit32 x = 0; x < 8; ++x) {
        B[y][x] = ((bit32)((x < 4) ? ((ap_int<33>)(((ap_int<33>)A[y][x]) + ((ap_int<33>)A[(y + 2)][(x + 2)]))) : ((ap_int<33>)0)));
      }
    }
  }

chhzh123 commented 4 years ago

Another test case failed.

def test2():
    hcl.init()
    target = hcl.platform.zc706
    target.config(compile="vivado_hls",mode="csyn")
    A = hcl.placeholder((10, 10), "A")
    def kernel(A):
        return hcl.compute((8, 8), lambda y, x: A[y][x] + A[y+2][x+2], "B")
    s = hcl.create_schedule(A, kernel)
    s.partition(A)
    s[kernel.B].pipeline(kernel.B.axis[1])
    s.to(A, target.xcel, stream_type=hcl.Stream.ZeroCopy)
    s.to(kernel.B, target.host, stream_type=hcl.Stream.ZeroCopy)
    f = hcl.build(s, target=target)
    np_A = np.random.randint(0, 10, (10, 10))
    np_B = np.zeros((8, 8))
    hcl_A = hcl.asarray(np_A)
    hcl_B = hcl.asarray(np_B)
    f(hcl_A, hcl_B)

Traceback (most recent call last):
  File "test.py", line 49, in <module>
    test2()
  File "test.py", line 39, in test2
    f = hcl.build(s, target=target)
  File "/mnt/f/heterocl-hecmay/python/heterocl/api.py", line 318, in build
    return _build(schedule.sch, new_inputs, target=target, name=name, stmt=stmt)
  File "/mnt/f/heterocl-hecmay/python/heterocl/tvm/build_module.py", line 554, in build
    return build_fpga_kernel(sch, args, target, name=name)
  File "/mnt/f/heterocl-hecmay/python/heterocl/tvm/build_module.py", line 502, in build_fpga_kernel
    return builder(fdevice, keys, vals)
  File "/mnt/f/heterocl-hecmay/python/heterocl/tvm/_ffi/function.py", line 280, in my_api_func
    return flocal(*args)
  File "/mnt/f/heterocl-hecmay/python/heterocl/tvm/_ffi/_ctypes/function.py", line 181, in __call__
    check_call(_LIB.TVMFuncCall(
  File "/mnt/f/heterocl-hecmay/python/heterocl/tvm/_ffi/base.py", line 66, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
heterocl.tvm._ffi.base.TVMError: [02:03:30] src/codegen/codegen_source_base.cc:85: Check failed: it != var_idmap_.end() Find undefined Variable A

hecmay commented 4 years ago

Thanks Hongzheng! This is still under development. I noticed that the function signature is not correct, and will fix it very soon.

zhangzhiru commented 4 years ago

Let's discuss this PR in our weekly zoom call. In general, we need to have a design driver for each new feature and we should add a lot more unit tests.

hecmay commented 4 years ago

The function signature issue has been fixed.

For the array partitioning example, the correct way to do that is as followed:

A_new = s.to(A, target.xcel, stream_type=hcl.Stream.ZeroCopy)
s.partition(A_new)

This should work, but there is still something going wrong with the op scheduling function. Will fix it soon.

zhangzhiru commented 4 years ago

What’s the meaning of zerocopy for a stream? This already sounds strange if not wrong in terms of terminology. A stream does push & pop. Where does zerocopy fit?

zhangzhiru commented 4 years ago

What other type do we have other than stream_type? Note that streaming should be one of the data movement types instead of the only one.

Do we actually mean “type=hcl.stream”?

hecmay commented 4 years ago

ZeroCopy is a mode of data movement, and streaming is another mode for data movement.

It's not a good idea to use hcl.Stream.ZeroCopy. Maybe we can use hcl.Move.ZeroCopy.

These are all modes we have now:

This PR will introduce zero-copy mode for data streaming. It should address the needs of #235.

Currently .to provides different modes for host-device data movement:

Streaming (FIFO): input data is streamed into/from host-device pipe. The data from/to the pipe is used immediately instead of being stored in global memory or on-chip buffer.

Copy: In this case, host program will migrate the host memory to global memory, and then create a on-chip buffer on FPGA to store the input data.

DoubleBuffer: host program creates a double buffer to overlap the kernel execution in dataflow fashion. This is not implemented yet.

We need to create a ZeroCopy mode, where host still migrates the memory object from system memory to global memory, but instead of creating an on-chip buffer, the kernel function directly access the DDR global memory. Program in such a mode may suffer from performance degradation. But this mode will not have the burst read/write nested loops, and would make the analysis easier.

hecmay commented 4 years ago

@Crystaldaidy Issue fixed. You should be able to run 3D reddening with VHLS flow now.

hecmay commented 4 years ago

@XiangyiZhao @AlgaPeng The "cp ** Not Found" issue should have been resolved in this PR. I only tested it locally and it worked from my end. Before this PR is merged into master, you may pull it back and check whether it works.

# clone the "fix" branch in my repo
git clone --single-branch --branch fix https://github.com/Hecmay/heterocl.git heterocl-test

# compile in a separate workdir
cd heterocl-test
make -j8

hecmay commented 4 years ago

@CrystalDaidy The bug has been fixed. Please refer to the steps I mentioned above to pull back and install this PR. The test program I used is available here: https://github.com/Hecmay/heterocl/blob/fix/samples/3d_rendering/3d_rendering_stage.py

hecmay commented 4 years ago

@AlgaPeng This issue is partially fixed in this PR. You should be able to run this design and generate real HW for sobel filter now. Please see the example here: https://github.com/Hecmay/heterocl/blob/fix/samples/sobel/sobel_vhls.py

The reuse_at primitive will cause SegFault (as described in #230), so I disabled those memory customizations to make it runnable.

cornell-zhang / heterocl

[API] New Data Movement Mode w/o Local On-chip Buffer #244