cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 92 forks source link

[API] Data placement and streaming API rewrite #265

Closed hecmay closed 3 years ago

hecmay commented 4 years ago

Other new features

Fixed bugs

zhangzhiru commented 4 years ago

@Hecmay with this PR, do we still require user specify .to() by default?

hecmay commented 4 years ago

@zhangzhiru The answer is No. The whole program will be offloaded to FPGA if no data placement is specified.

hecmay commented 4 years ago

@paldebjit Fixed all the issues you mentioned about AOCL CodeGen (removing __global for const, enforce the ap_int/ap_uint dtype to have bitwidth of the power of 2. Fixed the header issue and the tutorial example should be runnable now).

hecmay commented 4 years ago

@chhzh123 Sorry for the long waiting. The subgraph() API is available now. Please check this example: https://github.com/Hecmay/heterocl/blob/fix/tests/test_schedule_stream.py#L584

The subgraph() APIs returns a list of TVM stages extracted from the CDFG. You can add some attribute statements into the body of these stages, and let the VHLS codegen print "#pragma HLS dataflow" when seeing the specific attribute statement.

Also some other improvements:

chhzh123 commented 4 years ago

@chhzh123 Sorry for the long waiting. The subgraph() API is available now. Please check this example: https://github.com/Hecmay/heterocl/blob/fix/tests/test_schedule_stream.py#L584

The subgraph() APIs returns a list of TVM stages extracted from the CDFG. You can add some attribute statements into the body of these stages, and let the VHLS codegen print "#pragma HLS dataflow" when seeing the specific attribute statement.

Also some other improvements:

  • You do not have to specify .to() explicitly to offload the program to device. I added an auto-placement function which can automatically offload the whole kernel to FPGA if no .to() APIs is found in the program.
  • There is no restrictions on the ordering of how optimization primitives are applied. E.g. you can partition a tensor, and later move it to device, or the other way around. HeteroCL will keep the information, and generate a partitioned tensor on HW for both cases.
  • The previously auto-generated on-chip buffer has been removed. Now we use DMA as the default data movement option. Host-to-device streaming is also supported, but it is only available on certain devices (e.g. Alveo U250 or A10 with host-device pipe support).

Great! I'll have a check. Thanks!

seanlatias commented 4 years ago

@Hecmay can you clean up the code first by removing the unnecessary comments?

hecmay commented 4 years ago

@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.

seanlatias commented 4 years ago

@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.

Like Line 406 in tvm/src/schedule/schedule_reorder.cc

hecmay commented 4 years ago

@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.

Like Line 406 in tvm/src/schedule/schedule_reorder.cc

https://github.com/Hecmay/heterocl/blob/fix/tvm/src/schedule/schedule_reorder.cc#L406. I did not find any unnecessary comments close to line 406.

seanlatias commented 4 years ago

@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.

Like Line 406 in tvm/src/schedule/schedule_reorder.cc

https://github.com/Hecmay/heterocl/blob/fix/tvm/src/schedule/schedule_reorder.cc#L406. I did not find any unnecessary comments close to line 406.

604.

chhzh123 commented 4 years ago

Seems fixed-point datatype cannot be generated correctly.

dtype = hcl.Fixed(16,12)
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
    return B

In host code, the array declaration is

auto A = new ap_int<16>[10][32];
auto B = new ap_int<16>[10][32];

while in kernel code, the function signature is

test(ap_fixed<16, 4> B[10][32], ap_fixed<16, 4> A[10][32])

causing simulation error.

chhzh123 commented 4 years ago

Another issue may not strictly relate to this PR. It looks like the variable names use different generation logic in different parts, and the ports in INTERFACE use variable names that do not correspond to the names in the function signature. For example,

dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A.1", dtype=dtype)
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B.1", dtype=dtype)
    return B

The periods . in A and B's name have been replaced to underscores _,

test(float B_1[10][32], float A_1[10][32])

but the interface still uses A.1 and B.1 as the port name.

#pragma HLS INTERFACE m_axi port=A.1 offset=slave bundle=gmem0
chhzh123 commented 4 years ago

Is consecutive data streaming prohibited now? Something like

s.to(kernel.B, s[kernel.C])
s.to(kernel.C, s[kernel.D])

causes the following error.

heterocl.tvm._ffi.base.TVMError: [12:48:13] src/pass/stream_inference.cc:290: Check failed: attr_index * index_array.back() > 0 Tensor C cannot be read and written at the same time
chhzh123 commented 4 years ago

Moreover, for the simple add example

dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
    return B

It's unable to stream data from A to B.

hecmay commented 4 years ago

Moreover, for the simple add example

dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
    return B

It's unable to stream data from A to B.

If you run s.to(A, target.xcel, mode=hcl.IO.Stream), then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.

hecmay commented 4 years ago

Moreover, for the simple add example

dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
    return B

It's unable to stream data from A to B.

If you run s.to(A, target.xcel, mode=hcl.IO.Stream), then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.

This s.to(A, target.xcel) does not look right. I will fix this. Should be a simple fix.

chhzh123 commented 4 years ago

Moreover, for the simple add example

dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
    return B

It's unable to stream data from A to B.

If you run s.to(A, target.xcel, mode=hcl.IO.Stream), then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.

You mean our FPGA board does not support host-to-device streaming, and all these parameters should be stored on-chip first?

hecmay commented 4 years ago

Moreover, for the simple add example

dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
    return B

It's unable to stream data from A to B.

If you run s.to(A, target.xcel, mode=hcl.IO.Stream), then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.

You mean our FPGA board does not support host-to-device streaming, and all these parameters should be stored on-chip first?

All the parameters are DMAed from host to the off-chip DDR of the FPGA PAC (It can also be stored in other off-chip storage media like HBM and PLRAM). You can create an on-chip buffer if the parameters are not too large, or just access data from off-chip buffer (this can be expensive, so you may want to coalesce the memory access, e.g. increase the #elements/IO access by packing more elements into struct).

Our U280 FPGA does not support host-to-device streaming right now. The U250 FPGA should be able to support that, but the FPGA shell (i.e. QDMA shell) is not configured correctly... I am stilling coordinating with IT people to solve the problems.

zhangzhiru commented 4 years ago

@chhzh123 Are you saying some of these cases worked fine before, but are now failing due to the new changes in this PR? If so, which ones?

chhzh123 commented 4 years ago

@chhzh123 Are you saying some of these cases worked fine before, but are now failing due to the new changes in this PR? If so, which ones?

zhangzhiru commented 4 years ago

@chhzh123 very helpful clarification.

Data streaming between consecutive stages (A->B->C) should work using hls::stream, but this PR makes it an error.

@Hecmay Sounds like we didn't have a test case to cover this very important and common case?

hecmay commented 4 years ago

@chhzh123 Fixed now. Test cases added.

seanlatias commented 4 years ago

@Hecmay is the code ready to review now?

hecmay commented 4 years ago

@Hecmay is the code ready to review now?

Yes. It's ready.

seanlatias commented 4 years ago

@chhzh123 please help review the VHLS codegen part to see if that indeed meets your need.

chhzh123 commented 4 years ago

Yes, the streaming buffers are generated, but the original buffers are not removed. See the following example.

def test_consecutive():
    A = hcl.placeholder((10,), "A")

    def kernel(A):
        B = hcl.compute(A.shape, 
                lambda i: A[i] + 1, "B")
        C = hcl.compute(B.shape,
                lambda i: B[i] + 1, "C")
        D = hcl.compute(C.shape,
                lambda i: C[i] + 1, "D")
        return D

    target = hcl.platform.zc706
    target.config(compile="vivado_hls", mode="csim")
    s = hcl.create_schedule([A], kernel)
    s.to([A], target.xcel)
    s.to(kernel.D, target.host)
    s.to(kernel.B, s[kernel.C])
    s.to(kernel.C, s[kernel.D])
    f = hcl.build(s, target)
    np_A = np.zeros((10,))
    np_C = np.zeros((10,))
    hcl_A = hcl.asarray(np_A)
    hcl_C = hcl.asarray(np_C)
    f(hcl_A, hcl_C)

Only B_pipe_1 andC_pipe_2 are needed, while B and C buffers are also generated.

void test(bit32 A[10], bit32 D[10]) {
    bit32 B[10]; // useless buffer
    bit32 B_pipe_1[10];
    #pragma HLS stream variable=B_pipe_1 depth=1
    #pragma HLS dataflow
    B_i: for (bit32 i = 0; i < 10; ++i) {
      bit32 B_temp;
      B_temp = (A[i] + 1);
      B_pipe_1[i] = B_temp;
    }
    bit32 C[10]; // useless buffer
    bit32 C_pipe_2[10];
    #pragma HLS stream variable=C_pipe_2 depth=2
    C_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
      bit32 B_temp1;
      B_temp1 = B_pipe_1[i1];
      bit32 C_temp;
      C_temp = (B_temp1 + 1);
      C_pipe_2[i1] = C_temp;
    }
    D_i2: for (bit32 i2 = 0; i2 < 10; ++i2) {
      bit32 C_temp1;
      C_temp1 = C_pipe_2[i2];
      D[i2] = (C_temp1 + 1);
    }
  }
chhzh123 commented 4 years ago

Moreover, can #pragma HLS stream be used with #pragma HLS array_partition? If yes, they need to be guaranteed to work on the same buffer.

hecmay commented 4 years ago
  1. As for the unused buffer: I once had an IR pass to remove all the unused allocated buffer. Removing the unused allocate statement may violate some assumptions of the following IR passes. But I will try. I assume the HLS tool should be able to automatically ignore the unused buffer allocation.

  2. I do not think HLS stream and HLS array_partition can be applied on the same buffer. You are right, we should check that before the HLS tool throws out an error.

chhzh123 commented 4 years ago

Just a reminder, no need to fix this bug now. The codegen for complex memory access patterns using data streaming is wrong.

    A = hcl.placeholder((10,), "A")

    def kernel(A):
        B = hcl.compute(A.shape, 
                lambda i: A[i] + 1, "B")
        C = hcl.compute(B.shape,
                lambda i: hcl.select(i < 9, B[i] + B[i+1], B[i]),"C")
        return C

    s.to([A], target.xcel)
    s.to(kernel.C, target.host)
    s.to(kernel.B, s[kernel.C])

Here temporal variable for B[i+1] is not generated.

C_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
      bit32 B_temp1;
      B_temp1 = B_pipe_1[i1];
      C[i1] = ((bit32)((i1 < 9) ? ((ap_int<33>)(((ap_int<33>)B_temp1) + ((ap_int<33>)B_temp1))) : (((ap_int<33>)B_temp1))));
    }

This case should be prohibited, but now it passes HeteroCL's compilation.

zhangzhiru commented 4 years ago

The codegen for complex memory access patterns using data streaming is wrong.

@chhzh123 shall we file a separate bug on this issue?

hecmay commented 4 years ago

@seanlatias I ran into an issue when simulating KernelStmt node on CPU.

This is the problematic systolic array benchmark modified from Niansong's code: https://github.com/Hecmay/heterocl/blob/fix/samples/systolic_array/systolic_array_module.py#L45

Niansong was using ret[x] = pe_0(a, b, x) to update the tensor ret (it's a 1x3 tensor) and it worked fine. I changed it to a void function pe_0(a, b, ret) and update the values inside the function body. However, after I made the changes, the simulation returns all-zero result. I tried to print out the tensor ret using hcl.print, and it was not updated at all.

The IR does seem correct. You can run the code to reproduce the issue.

seanlatias commented 4 years ago

Ok, I'll take a look.

hecmay commented 4 years ago

@zzzDavid I updated the IP integration interface to make it more concise: https://github.com/Hecmay/heterocl/blob/fix/hlib/python/hlib/ip/vadd/vadd.py.

The current realization is very straightforward -- we create a stage for IP, and replace the stage body with the input source code when doing code generation (https://github.com/Hecmay/heterocl/blob/fix/python/heterocl/tvm/runtime.py#L25)

zzzDavid commented 4 years ago

Thanks, I'll try this out

hecmay commented 4 years ago

@yn224 The error is fixed. BTW you do not have to use the .to() explicitly to move the compute to FPGA. The auto-data-placement is enabled by default. The whole program will be automatically mapped to FPGA.

hecmay commented 4 years ago

@chhzh123 Can you pls let me know if the dataflow sequential access issue can be fixed after you change the input to NHWC? If it the HLS tool still fails to recognize the stages in the dataflow region, I will add a pass to wrap the stages into non-inlined functions calls. Example as followed

void funcA(){
  #pragma HLS inline off
  // ...
}
void funcB(){
  #pragma HLS inline off
  // ...
}
void top() {
  #pragma HLS dataflow
  funcA(...);
  funcB(...);
}
hecmay commented 4 years ago

@chhzh123 sorry that I just recalled you told me the tool still complains the access pattern is not sequential after moving to NHWC. I guess you can change the programming style to fix it, hopefully.

The reuse buffer generated originally:


for y_reuse ...
    forx_reuse 
        if (y_reuse > k)
            if (x_reuse > k)
              output[y_reuse-k][x_reuse-k] = ...

If we can do it in this way, it should fix the issue:


for y_reuse ...
    forx_reuse 
        temp = 0
        if (y_reuse > k)
            if (x_reuse > k)
              temp = ...
        output[y_reuse][x_reuse] = temp
hecmay commented 4 years ago

@seanlatias Tested on the new version with burst read and write. We are able to generate the code you mentioned before. Our servers are super sluggish now, so I cannot get any result from the synthesis report.

I did my best to fix the subgraph extraction algorithm. It should be robust enough and programmers should be able to map the whole DFG to FPGA in every case (e.g. if the DFG has multiple disconnected subgraphs). The subgraph offloading still has some limitations, we can discuss how to fix that in the future. The main challenge is still how to restore a complete dataflow graph hierarchy from the flattened graph.

chhzh123 commented 4 years ago

@Hecmay I remember you changed hls::stream to multi-dimension arrays in this PR? Why did you do that?

chhzh123 commented 4 years ago

@zhangzhiru

hecmay commented 4 years ago

@Hecmay I remember you changed hls::stream to multi-dimension arrays in this PR? Why did you do that?

hls::stream is functionally the same as a multi-dimensional array with #pragma HLS stream. From what I observed, the later one is more stable in the synthesis. Why do we want to use hls::stream instead?

zhangzhiru commented 4 years ago

From what I observed, the later one is more stable in the synthesis. Why do we want to use hls::stream instead?

When is the support of hls::stream less stable than array streaming? Do you have a test case? It's supposed to be the other way around unless we generate the incorrect code.

hecmay commented 4 years ago

From what I observed, the later one is more stable in the synthesis. Why do we want to use hls::stream instead?

When is the support of hls::stream less stable than array streaming? Do you have a test case? It's supposed to be the other way around unless we generate the incorrect code.

I do not have a test case right now. I just remember that I got some errors from Vivado saying that the synthesized RTL code is incorrect when testing hls::stream. It was quite a long time ago (maybe that has been fixed in the new version).

We can easily switch to hls::stream if there is any limitation with the multi-dim array and we really need hls::stream.

hecmay commented 4 years ago

@seanlatias Here is the HLS report of KMeans after adding the burst read and write loops (Loop 1 and 3 in the table). The latency is much improved.

+ Latency:
    * Summary:
    +---------+---------+----------+----------+--------+--------+---------+
    |  Latency (cycles) |  Latency (absolute) |     Interval    | Pipeline|
    |   min   |   max   |    min   |    max   |   min  |   max  |   Type  |
    +---------+---------+----------+----------+--------+--------+---------+
    |   481178|   481178| 1.604 ms | 1.604 ms |  481178|  481178|   none  |
    +---------+---------+----------+----------+--------+--------+---------+

    + Detail:
        * Instance:
        N/A

        * Loop:
        +-------------+---------+---------+----------+-----------+-----------+-------+----------+
        |             |  Latency (cycles) | Iteration|  Initiation Interval  |  Trip |          |
        |  Loop Name  |   min   |   max   |  Latency |  achieved |   target  | Count | Pipelined|
        +-------------+---------+---------+----------+-----------+-----------+-------+----------+
        |- Loop 1     |    10248|    10248|        10|          1|          1|  10240|    yes   |
        |- Loop 2     |   470600|   470600|      2353|          -|          -|    200|    no    |
        | + Loop 2.1  |      334|      334|        16|          1|          1|    320|    yes   |
        | + Loop 2.2  |      321|      321|         3|          1|          1|    320|    yes   |
        | + Loop 2.3  |     1168|     1168|        73|          -|          -|     16|    no    |
        |- Loop 3     |      321|      321|         3|          1|          1|    320|    yes   |
        +-------------+---------+---------+----------+-----------+-----------+-------+----------+

I also got some response about the interface stuff in Vitis-- it seems that newer version of Vitis does not require interface pragma. However, that is required for Vitis 2019.2 (the version we have on our server).

seanlatias commented 4 years ago

@Hecmay can you also try the results without burst mode?

hecmay commented 4 years ago

@seanlatias The result without burst rd/wr.

================================================================
== Performance Estimates
================================================================
+ Timing:
    * Summary:
    +--------+---------+----------+------------+
    |  Clock |  Target | Estimated| Uncertainty|
    +--------+---------+----------+------------+
    |ap_clk  | 3.33 ns | 2.433 ns |   0.90 ns  |
    +--------+---------+----------+------------+

+ Latency:
    * Summary:
    +----------+----------+-----------+-----------+----------+----------+---------+
    |   Latency (cycles)  |   Latency (absolute)  |       Interval      | Pipeline|
    |    min   |    max   |    min    |    max    |    min   |    max   |   Type  |
    +----------+----------+-----------+-----------+----------+----------+---------+
    |  36082801|  36082801| 0.120 sec | 0.120 sec |  36082801|  36082801|   none  |
    +----------+----------+-----------+-----------+----------+----------+---------+

    + Detail:
        * Instance:
        N/A

        * Loop:
        +-------------+----------+----------+----------+-----------+-----------+------+----------+
        |             |   Latency (cycles)  | Iteration|  Initiation Interval  | Trip |          |
        |  Loop Name  |    min   |    max   |  Latency |  achieved |   target  | Count| Pipelined|
        +-------------+----------+----------+----------+-----------+-----------+------+----------+
        |- Loop 1     |  36082800|  36082800|    180414|          -|          -|   200|    no    |
        | + Loop 1.1  |    168960|    168960|       528|          -|          -|   320|    no    |
        | + Loop 1.2  |     10248|     10248|        41|         32|          1|   320|    yes   |
        | + Loop 1.3  |      1184|      1184|        74|          -|          -|    16|    no    |
        +-------------+----------+----------+----------+-----------+-----------+------+----------+
seanlatias commented 4 years ago

So I guess the burst mode cannot be inferred automatically in this case.

BTW, can you add s[main_loop.update_mean].unroll(0) and then run it on the board? Let's compare the speedup between with/without optimization.

chhzh123 commented 4 years ago

@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.

hecmay commented 4 years ago

@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.

You mean Vivado HLS also requires the port width to be multiple of 8? No. Vivado HLS does not require the interface pragmas. It is only required in Vits 2019.2.

chhzh123 commented 4 years ago

@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.

You mean Vivado HLS also requires the port width to be multiple of 8? No. Vivado HLS does not require the interface pragmas. It is only required in Vits 2019.2.

I mean our codegen will generate #pragma HLS INTERFACE for Vivado HLS now, which requires the bitwidth of the input argument to be a multiple of 8.