hecmay commented 4 years ago

In this PR, we will introduce:

[x] Enhance the subgraph extraction algorithm for op scheduling: the current implantation works under many assumptions. The algorithm will check the attaching information between child stage and parent stage, and reconstruct the top-level function. This should fix #230 and #219. Test cases available here.

[x] Enhance support for streaming channel arrays. We can create streaming channels and pass values between them using a .to primitive. Here is an example for creating channels between Systolic Array PEs. The .to primitive creates self-loopback streaming channels, and pass values between PEs:


def kernel(A, B, O):
    localA = hcl.compute((m, k-1), lambda *args: 0, "localA")
    localB = hcl.compute((k-1, n), lambda *args: 0, "localB")
    def update(k, y, x):
        last = hcl.scalar(hcl.select(k==0, 0, O[y, x]), "last")
        localA[y, x] = hcl.select(x>0, localA[y, x-1], A[y, k])
        localB[y, x] = hcl.select(y>0, localB[y-1, x], B[k, x])
        O[y, x] = last.v + localA[y, x] * localB[y, x]
    hcl.mutate((m, dim_y, dim_x), 
        lambda k, y, x: update(k, y, x), name="update")

s[k].pipeline(k.axis[0])

self loopback streaming

s.to(k.localA, kernel.update) s.to(k.localB, kernel.update)

- [x] [**XRT Streaming between host and device**](https://github.com/Hecmay/heterocl/blob/stream_to/tests/test_schedule_stream.py#L443): updated the XOCL host code generator to generate streaming channel between host and device. Example OpenCL host program:
```c++
  cl::Kernel kernel(program, "test", &err);
  cl_platform_id platform_id = device.getInfo<CL_DEVICE_PLATFORM>(&err);
  xcl::Stream::init(platform_id);

  cl_mem_ext_ptr_t ext;
  ext.param = kernel.get();
  ext.obj = NULL;

  ext.flags = 0;
  cl_stream StreamExt_A = xcl::Stream::createStream(device.get(), CL_STREAM_READ_ONLY, CL_STREAM, &ext, &err);

[ ] Add compute unit duplication function (TODO): We can extract the kernel function, auto-merging all ops inside the function body, and perform loop splitting on the outermost loop. Note that this is not the full-fledged version. Still need to update the host code generator to make it work.
```
A_, B_ = s.to([A, B], target.xcel)
ret_ = s.to(kernel.ret, target.host)
# extract the kernel function based on data movement information 
kernel = s.subgraph(inputs=[A_, B_], outputs=[ret_])
# split the kernel into two compute units
kernel.duplicate(factor=2)
```

[x] Add performance report function for HLS function. This feature is enhanced by Hongzhong in #215. An example output from the HLS report.

[21:33:10] Generating harness files ...
[21:33:10] Compiling the program ...
[01:33:33] Resource consumption
[--------]    Clock Period :    5.806
[--------]    Best Latency :     2303
[--------] Average Latency :     2303
[--------]   Worst Latency :     2303
[--------]            BRAM :        2
[--------]              FF :      154
[--------]             LUT :      425
[--------]          DSP48E :        0
[21:33:33] Execution complete

[x] Add shimds updating function to reuse pre-generated bitstream: The shared memory IDs will expire after program exits. However, the generated bitstream files can be reused if the generated kernel code is unchanged. This function checks the HASH value of generated kernel code. If the HASH stays the same and bitstream exists, then HeteroCL will only update the shmids in host program and reuse the pre-generated bitstream.
```
[15:07:57] Found existing bitstream in work directory. Checking HASH ... 
[15:07:57] HASH matched. Updating host program shmids ... 
[15:07:57] Compiling the host program...
```

[x] Fixed VHLS tensor representation issue: restore to multi-dimension tensor format, instead of flattening the indices all the time. The Vivado HLSC CodeGen will generate load and store operations with multiple dimensions. This should fix #233. Example as followed:

void default_function(ap_int<32> A[10][10], ap_int<32> B[2][5][10]) {
for (ap_int<32> y_outer = 0; y_outer < 2; ++y_outer) {
for (ap_int<32> y_inner = 0; y_inner < 5; ++y_inner) {
  for (ap_int<32> x = 0; x < 10; ++x) {
    B[y_outer][y_inner][x] = A[(y_inner + (y_outer * 5))][x];
  }
}
}
}

zhangzhiru commented 4 years ago

Multi-device support (to allow users to split the workload into multiple FPGA boards).

This is cool. But what's the user case and how to test such feature?

Enhance the graph partitioning algorithm for op scheduling. For example, the algorithm assumes that the users specifies all the data movement to form a subgraph in the CFG, which is not always true.

The graph partitioning is used for support .to()? Can you open a separate issue to list the potential bugs due to the assumptions we are making?

hecmay commented 4 years ago

The user case would be, for example, to split a large workload or to duplication compute units for data parallelism. For now we only test the correctness of generated IR, the generated program can be tested on AWS F1 instance with multiple FPGA cards.
To be strict, I should call it subgraph extraction algorithm. it's not the classic graph partitioning algorithm (where the whole graph is partitioned into balanced subgraphs and communication is minimized). Sure I can create an issue.

hecmay commented 4 years ago

Added a simple resource reporting function. Example output:

[21:33:10] Generating harness files ...
[21:33:10] Compiling the program ...
[01:33:33] Resource consumption
[--------]    Clock Period :    5.806
[--------]    Best Latency :     2303
[--------] Average Latency :     2303
[--------]   Worst Latency :     2303
[--------]            BRAM :        2
[--------]              FF :      154
[--------]             LUT :      425
[--------]          DSP48E :        0
[21:33:33] Execution complete

hecmay commented 4 years ago

[x] XRT Streaming between host and device
[ ] Add double buffer support between host and device
[ ] Add stencil support (connecting SODA with HLSC CodeGen flow): we extract the optimized pattern and insert it into HeteroCL as an external module
[ ] Peer-to-peer streaming (between SSD and FPGA on AWS for example).
[x] Fix the IP Integration error (since the calling function may create tensor with side effects outside IP function, we need to create empty tensor beforehand. Host-created device-used tensor will be automatically handled by IR pass)
[x] Fix the subgraph extraction issue: the subgraph should not contain host-declared extern ops (disabled now, the scheduled op order can be misaligned).
[x] Add auto-merging for all ops in the python subgraph API
[x] Add performance report function for HLS function
[x] Add test cases for .to combined with other primitives; Fix #219
[ ] Replace the multi-process shared memory with Boost python to connect python testbench and C++/OpenCL host (the linux shmop is not that stable and may cause SegFault)

hecmay commented 4 years ago

An example for kernel duplication: @Huyuwei

    def test_merge_kernel_stages():
        hcl.init()
        A = hcl.placeholder((10, 32), "A")
        B = hcl.placeholder((10, 32), "B")

        def kernel(A, B):
            C = hcl.compute(A.shape, lambda i, j: 0, "C")
            hcl.update(C, lambda i, j: A[i,j] + 1, "s1")
            hcl.update(C, lambda i, j: B[i,j] * 2, "s2")
            return hcl.compute(C.shape, lambda *args: C[args] + 3, "ret")

        target = hcl.platform.aws_f1
        s = hcl.create_schedule([A, B], kernel)

        A_, B_ = s.to([A, B], target.xcel)
        ret_ = s.to(kernel.ret, target.host)

        kernel = s.subgraph(inputs=[A_, B_], outputs=[ret_])
        kernel.duplicate(factor=2)

seanlatias commented 4 years ago

Can you fix the conflicts first? Also, for each of the new features, is there a corresponding test case? If so, can you point that out so it'd be easier to look? For example, after each feature, write the test case name.

seanlatias commented 4 years ago

Oh, this is not ready to review? I looked at the wrong PR.

hecmay commented 4 years ago

Features to be added in the next PR:

[ ] Double buffer mode for host-device communication (on-chip double buffer is automatically enabled by dataflow optimization by HLS tools)
[ ] Multi-device support (to allow users to split the workload into multiple FPGA boards). We will generate separate kernel functions and corresponding host program (e.g. with different command queues and kernel functions).
```
     p = hcl.platform.custom(config)
     s = hcl.create_schedule([A, B], kernel)
     s.to(A, p.xcel[1])
     s.to(B, p.xcel[2])
```
[ ] Add stencil support (connecting SODA with HLSC CodeGen flow): we extract the optimized pattern and insert it into HeteroCL as an external module
[ ] Peer-to-peer streaming (between SSD and FPGA on AWS for example).
[ ] LLVM JIT: update the concurrent stages scheduling function.
[ ] Add an example of of streaming between two consecutive 2D CONV layers. The first CONV layer is split in X/Y axis. Here we create separate streaming channels for each split segment. All MAC operations are pipelined in each segment, and all the segments are supposed to execute in parallel.

// First CONV Layer 
unrolled (yo, 0, N) {
  unrolled (xo, 0, M) {
    pipelined (xi, 0, K) {
      for (yi, 0, L ) {
        local = 0
        for (kx, 0, K1) {
          for (ky, 0, K2) { 
            local += Input[...] * kernel[...]
          }
        }
        O[...] = local
        channel[yo, xo].write(local)
      }
    }
  }
}

// Next CONV Layer
unrolled (yo, 0, N) {
  unrolled (xo, 0, M) {
    // shift register as reuse line buffer 
    buffer[yo, xo].insert(channel[yo, xo].read())
    pipelined (xi, 0, K) {
      for (yi, 0, L) {
        local = 0
        for (kx, 0, K1) {
          for (ky, 0, K2) {
            local += buffer[...] * kernel[...]
          }
        }
        O[...] = local
      }
    }
  }
}

hecmay commented 4 years ago

@seanlatias Update the description. For some of the minor changes (e.g. HLS reporting function, multi-dimension tensor), I did not put the link to test cases here.

seanlatias commented 4 years ago

The second one is still unclear to me. In your example, both yo and xo should be paralleled, right? For the second conv layer, are yo and xo also paralleled? The shift register is also inferred automatically? When will it be shifted? I suppose what you are doing is just simply performing tiling, which results in multiple PEs for the first conv layer? How about the second conv? Do we have a single PE or multiple PEs?

hecmay commented 4 years ago

The second one is still unclear to me. In your example, both yo and xo should be paralleled, right? For the second conv layer, are yo and xo also paralleled? The shift register is also inferred automatically? When will it be shifted? I suppose what you are doing is just simply performing tiling, which results in multiple PEs for the first conv layer? How about the second conv? Do we have a single PE or multiple PEs?

Yes, you are right. yo and xo should be running in parallel for both two layers. I suppose this can be realized by inserting dataflow pragma on top of these two layers and unrolling the xo/yo axes?

The second layer should also be tiled into multiple PEs. I plan to use the reuse_at primitive to generate the shift registers. This is not fully realized yet. I only have a simple example to showcase the streaming arrays.

hecmay commented 4 years ago

Most changes have been addressed. I think we need to consider Copy and ZeroCopy case separately. I can put it into another PR.

seanlatias commented 4 years ago

@Hecmay, can you also add a test for #240? Thanks.

hecmay commented 4 years ago

Sure. Here is the test case: https://github.com/Hecmay/heterocl/blob/stream_to/tests/test_schedule_stream.py#L531

The allocate statement for partitioned stage is not optimized away. But it does not matter that much. I suppose there might be something wrong the lift_alloc_attr pass.

cornell-zhang / heterocl

[API][Utils] Streaming Support Enhancement v2 #206

self loopback streaming