cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

[API][Utils] Streaming Support Enhancement v2 #206

Closed hecmay closed 4 years ago

hecmay commented 4 years ago

In this PR, we will introduce:

s[k].pipeline(k.axis[0])

self loopback streaming

s.to(k.localA, kernel.update) s.to(k.localB, kernel.update)

- [x] [**XRT Streaming between host and device**](https://github.com/Hecmay/heterocl/blob/stream_to/tests/test_schedule_stream.py#L443): updated the XOCL host code generator to generate streaming channel between host and device. Example OpenCL host program:
```c++
  cl::Kernel kernel(program, "test", &err);
  cl_platform_id platform_id = device.getInfo<CL_DEVICE_PLATFORM>(&err);
  xcl::Stream::init(platform_id);

  cl_mem_ext_ptr_t ext;
  ext.param = kernel.get();
  ext.obj = NULL;

  ext.flags = 0;
  cl_stream StreamExt_A = xcl::Stream::createStream(device.get(), CL_STREAM_READ_ONLY, CL_STREAM, &ext, &err);
zhangzhiru commented 4 years ago

Multi-device support (to allow users to split the workload into multiple FPGA boards).

This is cool. But what's the user case and how to test such feature?

Enhance the graph partitioning algorithm for op scheduling. For example, the algorithm assumes that the users specifies all the data movement to form a subgraph in the CFG, which is not always true.

The graph partitioning is used for support .to()? Can you open a separate issue to list the potential bugs due to the assumptions we are making?

hecmay commented 4 years ago
  1. The user case would be, for example, to split a large workload or to duplication compute units for data parallelism. For now we only test the correctness of generated IR, the generated program can be tested on AWS F1 instance with multiple FPGA cards.

  2. To be strict, I should call it subgraph extraction algorithm. it's not the classic graph partitioning algorithm (where the whole graph is partitioned into balanced subgraphs and communication is minimized). Sure I can create an issue.

hecmay commented 4 years ago

Added a simple resource reporting function. Example output:

[21:33:10] Generating harness files ...
[21:33:10] Compiling the program ...
[01:33:33] Resource consumption
[--------]    Clock Period :    5.806
[--------]    Best Latency :     2303
[--------] Average Latency :     2303
[--------]   Worst Latency :     2303
[--------]            BRAM :        2
[--------]              FF :      154
[--------]             LUT :      425
[--------]          DSP48E :        0
[21:33:33] Execution complete
hecmay commented 4 years ago
hecmay commented 4 years ago

An example for kernel duplication: @Huyuwei

    def test_merge_kernel_stages():
        hcl.init()
        A = hcl.placeholder((10, 32), "A")
        B = hcl.placeholder((10, 32), "B")

        def kernel(A, B):
            C = hcl.compute(A.shape, lambda i, j: 0, "C")
            hcl.update(C, lambda i, j: A[i,j] + 1, "s1")
            hcl.update(C, lambda i, j: B[i,j] * 2, "s2")
            return hcl.compute(C.shape, lambda *args: C[args] + 3, "ret")

        target = hcl.platform.aws_f1
        s = hcl.create_schedule([A, B], kernel)

        A_, B_ = s.to([A, B], target.xcel)
        ret_ = s.to(kernel.ret, target.host)

        kernel = s.subgraph(inputs=[A_, B_], outputs=[ret_])
        kernel.duplicate(factor=2)
seanlatias commented 4 years ago

Can you fix the conflicts first? Also, for each of the new features, is there a corresponding test case? If so, can you point that out so it'd be easier to look? For example, after each feature, write the test case name.

seanlatias commented 4 years ago

Oh, this is not ready to review? I looked at the wrong PR.

hecmay commented 4 years ago

Features to be added in the next PR:

// First CONV Layer 
unrolled (yo, 0, N) {
  unrolled (xo, 0, M) {
    pipelined (xi, 0, K) {
      for (yi, 0, L ) {
        local = 0
        for (kx, 0, K1) {
          for (ky, 0, K2) { 
            local += Input[...] * kernel[...]
          }
        }
        O[...] = local
        channel[yo, xo].write(local)
      }
    }
  }
}

// Next CONV Layer
unrolled (yo, 0, N) {
  unrolled (xo, 0, M) {
    // shift register as reuse line buffer 
    buffer[yo, xo].insert(channel[yo, xo].read())
    pipelined (xi, 0, K) {
      for (yi, 0, L) {
        local = 0
        for (kx, 0, K1) {
          for (ky, 0, K2) {
            local += buffer[...] * kernel[...]
          }
        }
        O[...] = local
      }
    }
  }
}    
hecmay commented 4 years ago

@seanlatias Update the description. For some of the minor changes (e.g. HLS reporting function, multi-dimension tensor), I did not put the link to test cases here.

seanlatias commented 4 years ago

The second one is still unclear to me. In your example, both yo and xo should be paralleled, right? For the second conv layer, are yo and xo also paralleled? The shift register is also inferred automatically? When will it be shifted? I suppose what you are doing is just simply performing tiling, which results in multiple PEs for the first conv layer? How about the second conv? Do we have a single PE or multiple PEs?

hecmay commented 4 years ago

The second one is still unclear to me. In your example, both yo and xo should be paralleled, right? For the second conv layer, are yo and xo also paralleled? The shift register is also inferred automatically? When will it be shifted? I suppose what you are doing is just simply performing tiling, which results in multiple PEs for the first conv layer? How about the second conv? Do we have a single PE or multiple PEs?

Yes, you are right. yo and xo should be running in parallel for both two layers. I suppose this can be realized by inserting dataflow pragma on top of these two layers and unrolling the xo/yo axes?

The second layer should also be tiled into multiple PEs. I plan to use the reuse_at primitive to generate the shift registers. This is not fully realized yet. I only have a simple example to showcase the streaming arrays.

hecmay commented 4 years ago

Most changes have been addressed. I think we need to consider Copy and ZeroCopy case separately. I can put it into another PR.

seanlatias commented 4 years ago

@Hecmay, can you also add a test for #240? Thanks.

hecmay commented 4 years ago

Sure. Here is the test case: https://github.com/Hecmay/heterocl/blob/stream_to/tests/test_schedule_stream.py#L531

The allocate statement for partitioned stage is not optimized away. But it does not matter that much. I suppose there might be something wrong the lift_alloc_attr pass.