Closed hecmay closed 3 years ago
@Hecmay with this PR, do we still require user specify .to() by default?
@zhangzhiru The answer is No. The whole program will be offloaded to FPGA if no data placement is specified.
@paldebjit Fixed all the issues you mentioned about AOCL CodeGen (removing __global for const, enforce the ap_int/ap_uint dtype to have bitwidth of the power of 2. Fixed the header issue and the tutorial example should be runnable now).
@chhzh123 Sorry for the long waiting. The subgraph() API is available now. Please check this example: https://github.com/Hecmay/heterocl/blob/fix/tests/test_schedule_stream.py#L584
The subgraph() APIs returns a list of TVM stages extracted from the CDFG. You can add some attribute statements into the body of these stages, and let the VHLS codegen print "#pragma HLS dataflow" when seeing the specific attribute statement.
Also some other improvements:
@chhzh123 Sorry for the long waiting. The subgraph() API is available now. Please check this example: https://github.com/Hecmay/heterocl/blob/fix/tests/test_schedule_stream.py#L584
The subgraph() APIs returns a list of TVM stages extracted from the CDFG. You can add some attribute statements into the body of these stages, and let the VHLS codegen print "#pragma HLS dataflow" when seeing the specific attribute statement.
Also some other improvements:
- You do not have to specify .to() explicitly to offload the program to device. I added an auto-placement function which can automatically offload the whole kernel to FPGA if no .to() APIs is found in the program.
- There is no restrictions on the ordering of how optimization primitives are applied. E.g. you can partition a tensor, and later move it to device, or the other way around. HeteroCL will keep the information, and generate a partitioned tensor on HW for both cases.
- The previously auto-generated on-chip buffer has been removed. Now we use DMA as the default data movement option. Host-to-device streaming is also supported, but it is only available on certain devices (e.g. Alveo U250 or A10 with host-device pipe support).
Great! I'll have a check. Thanks!
@Hecmay can you clean up the code first by removing the unnecessary comments?
@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.
@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.
Like Line 406 in tvm/src/schedule/schedule_reorder.cc
@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.
Like Line 406 in tvm/src/schedule/schedule_reorder.cc
https://github.com/Hecmay/heterocl/blob/fix/tvm/src/schedule/schedule_reorder.cc#L406. I did not find any unnecessary comments close to line 406.
@seanlatias There are some unnecessary files, and I will remove soon. What kind of comments do you think are unnecessary, for example? I checked the code again, and found most of them useful to me.
Like Line 406 in tvm/src/schedule/schedule_reorder.cc
https://github.com/Hecmay/heterocl/blob/fix/tvm/src/schedule/schedule_reorder.cc#L406. I did not find any unnecessary comments close to line 406.
604.
Seems fixed-point datatype cannot be generated correctly.
dtype = hcl.Fixed(16,12)
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
return B
In host code, the array declaration is
auto A = new ap_int<16>[10][32];
auto B = new ap_int<16>[10][32];
while in kernel code, the function signature is
test(ap_fixed<16, 4> B[10][32], ap_fixed<16, 4> A[10][32])
causing simulation error.
Another issue may not strictly relate to this PR. It looks like the variable names use different generation logic in different parts, and the ports in INTERFACE
use variable names that do not correspond to the names in the function signature. For example,
dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A.1", dtype=dtype)
def kernel(A):
B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B.1", dtype=dtype)
return B
The periods .
in A
and B
's name have been replaced to underscores _
,
test(float B_1[10][32], float A_1[10][32])
but the interface still uses A.1
and B.1
as the port name.
#pragma HLS INTERFACE m_axi port=A.1 offset=slave bundle=gmem0
Is consecutive data streaming prohibited now? Something like
s.to(kernel.B, s[kernel.C])
s.to(kernel.C, s[kernel.D])
causes the following error.
heterocl.tvm._ffi.base.TVMError: [12:48:13] src/pass/stream_inference.cc:290: Check failed: attr_index * index_array.back() > 0 Tensor C cannot be read and written at the same time
Moreover, for the simple add example
dtype = hcl.Float()
A = hcl.placeholder((10, 32), "A", dtype=dtype)
def kernel(A):
B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype)
return B
It's unable to stream data from A
to B
.
Moreover, for the simple add example
dtype = hcl.Float() A = hcl.placeholder((10, 32), "A", dtype=dtype) def kernel(A): B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype) return B
It's unable to stream data from
A
toB
.
If you run s.to(A, target.xcel, mode=hcl.IO.Stream)
, then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.
Moreover, for the simple add example
dtype = hcl.Float() A = hcl.placeholder((10, 32), "A", dtype=dtype) def kernel(A): B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype) return B
It's unable to stream data from
A
toB
.If you run
s.to(A, target.xcel, mode=hcl.IO.Stream)
, then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.
This s.to(A, target.xcel)
does not look right. I will fix this. Should be a simple fix.
Moreover, for the simple add example
dtype = hcl.Float() A = hcl.placeholder((10, 32), "A", dtype=dtype) def kernel(A): B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype) return B
It's unable to stream data from
A
toB
.If you run
s.to(A, target.xcel, mode=hcl.IO.Stream)
, then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.
You mean our FPGA board does not support host-to-device streaming, and all these parameters should be stored on-chip first?
Moreover, for the simple add example
dtype = hcl.Float() A = hcl.placeholder((10, 32), "A", dtype=dtype) def kernel(A): B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B", dtype=dtype) return B
It's unable to stream data from
A
toB
.If you run
s.to(A, target.xcel, mode=hcl.IO.Stream)
, then the tensor A should be streamed to its consumer (i.e. Tensor B) automatically. I did not test this case since we do not have any HW support for host-to-device streaming now. Will add a new test case for it.You mean our FPGA board does not support host-to-device streaming, and all these parameters should be stored on-chip first?
All the parameters are DMAed from host to the off-chip DDR of the FPGA PAC (It can also be stored in other off-chip storage media like HBM and PLRAM). You can create an on-chip buffer if the parameters are not too large, or just access data from off-chip buffer (this can be expensive, so you may want to coalesce the memory access, e.g. increase the #elements/IO access by packing more elements into struct).
Our U280 FPGA does not support host-to-device streaming right now. The U250 FPGA should be able to support that, but the FPGA shell (i.e. QDMA shell) is not configured correctly... I am stilling coordinating with IT people to solve the problems.
@chhzh123 Are you saying some of these cases worked fine before, but are now failing due to the new changes in this PR? If so, which ones?
@chhzh123 Are you saying some of these cases worked fine before, but are now failing due to the new changes in this PR? If so, which ones?
.to
implementation.hls::stream
, but this PR makes it an error.@chhzh123 very helpful clarification.
Data streaming between consecutive stages (A->B->C) should work using hls::stream, but this PR makes it an error.
@Hecmay Sounds like we didn't have a test case to cover this very important and common case?
@chhzh123 Fixed now. Test cases added.
@Hecmay is the code ready to review now?
@Hecmay is the code ready to review now?
Yes. It's ready.
@chhzh123 please help review the VHLS codegen part to see if that indeed meets your need.
Yes, the streaming buffers are generated, but the original buffers are not removed. See the following example.
def test_consecutive():
A = hcl.placeholder((10,), "A")
def kernel(A):
B = hcl.compute(A.shape,
lambda i: A[i] + 1, "B")
C = hcl.compute(B.shape,
lambda i: B[i] + 1, "C")
D = hcl.compute(C.shape,
lambda i: C[i] + 1, "D")
return D
target = hcl.platform.zc706
target.config(compile="vivado_hls", mode="csim")
s = hcl.create_schedule([A], kernel)
s.to([A], target.xcel)
s.to(kernel.D, target.host)
s.to(kernel.B, s[kernel.C])
s.to(kernel.C, s[kernel.D])
f = hcl.build(s, target)
np_A = np.zeros((10,))
np_C = np.zeros((10,))
hcl_A = hcl.asarray(np_A)
hcl_C = hcl.asarray(np_C)
f(hcl_A, hcl_C)
Only B_pipe_1
andC_pipe_2
are needed, while B
and C
buffers are also generated.
void test(bit32 A[10], bit32 D[10]) {
bit32 B[10]; // useless buffer
bit32 B_pipe_1[10];
#pragma HLS stream variable=B_pipe_1 depth=1
#pragma HLS dataflow
B_i: for (bit32 i = 0; i < 10; ++i) {
bit32 B_temp;
B_temp = (A[i] + 1);
B_pipe_1[i] = B_temp;
}
bit32 C[10]; // useless buffer
bit32 C_pipe_2[10];
#pragma HLS stream variable=C_pipe_2 depth=2
C_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
bit32 B_temp1;
B_temp1 = B_pipe_1[i1];
bit32 C_temp;
C_temp = (B_temp1 + 1);
C_pipe_2[i1] = C_temp;
}
D_i2: for (bit32 i2 = 0; i2 < 10; ++i2) {
bit32 C_temp1;
C_temp1 = C_pipe_2[i2];
D[i2] = (C_temp1 + 1);
}
}
Moreover, can #pragma HLS stream
be used with #pragma HLS array_partition
? If yes, they need to be guaranteed to work on the same buffer.
As for the unused buffer: I once had an IR pass to remove all the unused allocated buffer. Removing the unused allocate statement may violate some assumptions of the following IR passes. But I will try. I assume the HLS tool should be able to automatically ignore the unused buffer allocation.
I do not think HLS stream and HLS array_partition can be applied on the same buffer. You are right, we should check that before the HLS tool throws out an error.
Just a reminder, no need to fix this bug now. The codegen for complex memory access patterns using data streaming is wrong.
A = hcl.placeholder((10,), "A")
def kernel(A):
B = hcl.compute(A.shape,
lambda i: A[i] + 1, "B")
C = hcl.compute(B.shape,
lambda i: hcl.select(i < 9, B[i] + B[i+1], B[i]),"C")
return C
s.to([A], target.xcel)
s.to(kernel.C, target.host)
s.to(kernel.B, s[kernel.C])
Here temporal variable for B[i+1]
is not generated.
C_i1: for (bit32 i1 = 0; i1 < 10; ++i1) {
bit32 B_temp1;
B_temp1 = B_pipe_1[i1];
C[i1] = ((bit32)((i1 < 9) ? ((ap_int<33>)(((ap_int<33>)B_temp1) + ((ap_int<33>)B_temp1))) : (((ap_int<33>)B_temp1))));
}
This case should be prohibited, but now it passes HeteroCL's compilation.
The codegen for complex memory access patterns using data streaming is wrong.
@chhzh123 shall we file a separate bug on this issue?
@seanlatias I ran into an issue when simulating KernelStmt node on CPU.
This is the problematic systolic array benchmark modified from Niansong's code: https://github.com/Hecmay/heterocl/blob/fix/samples/systolic_array/systolic_array_module.py#L45
Niansong was using ret[x] = pe_0(a, b, x)
to update the tensor ret
(it's a 1x3 tensor) and it worked fine. I changed it to a void function pe_0(a, b, ret)
and update the values inside the function body. However, after I made the changes, the simulation returns all-zero result. I tried to print out the tensor ret
using hcl.print
, and it was not updated at all.
The IR does seem correct. You can run the code to reproduce the issue.
Ok, I'll take a look.
@zzzDavid I updated the IP integration interface to make it more concise: https://github.com/Hecmay/heterocl/blob/fix/hlib/python/hlib/ip/vadd/vadd.py.
The current realization is very straightforward -- we create a stage for IP, and replace the stage body with the input source code when doing code generation (https://github.com/Hecmay/heterocl/blob/fix/python/heterocl/tvm/runtime.py#L25)
Thanks, I'll try this out
@yn224 The error is fixed. BTW you do not have to use the .to() explicitly to move the compute to FPGA. The auto-data-placement is enabled by default. The whole program will be automatically mapped to FPGA.
@chhzh123 Can you pls let me know if the dataflow sequential access issue can be fixed after you change the input to NHWC? If it the HLS tool still fails to recognize the stages in the dataflow region, I will add a pass to wrap the stages into non-inlined functions calls. Example as followed
void funcA(){
#pragma HLS inline off
// ...
}
void funcB(){
#pragma HLS inline off
// ...
}
void top() {
#pragma HLS dataflow
funcA(...);
funcB(...);
}
@chhzh123 sorry that I just recalled you told me the tool still complains the access pattern is not sequential after moving to NHWC. I guess you can change the programming style to fix it, hopefully.
The reuse buffer generated originally:
for y_reuse ...
forx_reuse
if (y_reuse > k)
if (x_reuse > k)
output[y_reuse-k][x_reuse-k] = ...
If we can do it in this way, it should fix the issue:
for y_reuse ...
forx_reuse
temp = 0
if (y_reuse > k)
if (x_reuse > k)
temp = ...
output[y_reuse][x_reuse] = temp
@seanlatias Tested on the new version with burst read and write. We are able to generate the code you mentioned before. Our servers are super sluggish now, so I cannot get any result from the synthesis report.
I did my best to fix the subgraph extraction algorithm. It should be robust enough and programmers should be able to map the whole DFG to FPGA in every case (e.g. if the DFG has multiple disconnected subgraphs). The subgraph offloading still has some limitations, we can discuss how to fix that in the future. The main challenge is still how to restore a complete dataflow graph hierarchy from the flattened graph.
@Hecmay I remember you changed hls::stream
to multi-dimension arrays in this PR? Why did you do that?
@zhangzhiru
@Hecmay I remember you changed
hls::stream
to multi-dimension arrays in this PR? Why did you do that?
hls::stream
is functionally the same as a multi-dimensional array with #pragma HLS stream
. From what I observed, the later one is more stable in the synthesis. Why do we want to use hls::stream
instead?
From what I observed, the later one is more stable in the synthesis. Why do we want to use
hls::stream
instead?
When is the support of hls::stream less stable than array streaming? Do you have a test case? It's supposed to be the other way around unless we generate the incorrect code.
From what I observed, the later one is more stable in the synthesis. Why do we want to use
hls::stream
instead?When is the support of hls::stream less stable than array streaming? Do you have a test case? It's supposed to be the other way around unless we generate the incorrect code.
I do not have a test case right now. I just remember that I got some errors from Vivado saying that the synthesized RTL code is incorrect when testing hls::stream. It was quite a long time ago (maybe that has been fixed in the new version).
We can easily switch to hls::stream
if there is any limitation with the multi-dim array and we really need hls::stream
.
@seanlatias Here is the HLS report of KMeans after adding the burst read and write loops (Loop 1 and 3 in the table). The latency is much improved.
+ Latency:
* Summary:
+---------+---------+----------+----------+--------+--------+---------+
| Latency (cycles) | Latency (absolute) | Interval | Pipeline|
| min | max | min | max | min | max | Type |
+---------+---------+----------+----------+--------+--------+---------+
| 481178| 481178| 1.604 ms | 1.604 ms | 481178| 481178| none |
+---------+---------+----------+----------+--------+--------+---------+
+ Detail:
* Instance:
N/A
* Loop:
+-------------+---------+---------+----------+-----------+-----------+-------+----------+
| | Latency (cycles) | Iteration| Initiation Interval | Trip | |
| Loop Name | min | max | Latency | achieved | target | Count | Pipelined|
+-------------+---------+---------+----------+-----------+-----------+-------+----------+
|- Loop 1 | 10248| 10248| 10| 1| 1| 10240| yes |
|- Loop 2 | 470600| 470600| 2353| -| -| 200| no |
| + Loop 2.1 | 334| 334| 16| 1| 1| 320| yes |
| + Loop 2.2 | 321| 321| 3| 1| 1| 320| yes |
| + Loop 2.3 | 1168| 1168| 73| -| -| 16| no |
|- Loop 3 | 321| 321| 3| 1| 1| 320| yes |
+-------------+---------+---------+----------+-----------+-----------+-------+----------+
I also got some response about the interface stuff in Vitis-- it seems that newer version of Vitis does not require interface pragma. However, that is required for Vitis 2019.2 (the version we have on our server).
@Hecmay can you also try the results without burst mode?
@seanlatias The result without burst rd/wr.
================================================================
== Performance Estimates
================================================================
+ Timing:
* Summary:
+--------+---------+----------+------------+
| Clock | Target | Estimated| Uncertainty|
+--------+---------+----------+------------+
|ap_clk | 3.33 ns | 2.433 ns | 0.90 ns |
+--------+---------+----------+------------+
+ Latency:
* Summary:
+----------+----------+-----------+-----------+----------+----------+---------+
| Latency (cycles) | Latency (absolute) | Interval | Pipeline|
| min | max | min | max | min | max | Type |
+----------+----------+-----------+-----------+----------+----------+---------+
| 36082801| 36082801| 0.120 sec | 0.120 sec | 36082801| 36082801| none |
+----------+----------+-----------+-----------+----------+----------+---------+
+ Detail:
* Instance:
N/A
* Loop:
+-------------+----------+----------+----------+-----------+-----------+------+----------+
| | Latency (cycles) | Iteration| Initiation Interval | Trip | |
| Loop Name | min | max | Latency | achieved | target | Count| Pipelined|
+-------------+----------+----------+----------+-----------+-----------+------+----------+
|- Loop 1 | 36082800| 36082800| 180414| -| -| 200| no |
| + Loop 1.1 | 168960| 168960| 528| -| -| 320| no |
| + Loop 1.2 | 10248| 10248| 41| 32| 1| 320| yes |
| + Loop 1.3 | 1184| 1184| 74| -| -| 16| no |
+-------------+----------+----------+----------+-----------+-----------+------+----------+
So I guess the burst mode cannot be inferred automatically in this case.
BTW, can you add s[main_loop.update_mean].unroll(0)
and then run it on the board? Let's compare the speedup between with/without optimization.
@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.
@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.
You mean Vivado HLS also requires the port width to be multiple of 8? No. Vivado HLS does not require the interface pragmas. It is only required in Vits 2019.2.
@Hecmay Did you add back the interface for Vivado HLS? #261 is not only a Vitis problem, but also works for Vivado HLS.
You mean Vivado HLS also requires the port width to be multiple of 8? No. Vivado HLS does not require the interface pragmas. It is only required in Vits 2019.2.
I mean our codegen will generate #pragma HLS INTERFACE
for Vivado HLS now, which requires the bitwidth of the input argument to be a multiple of 8.
resue_at
(by adding an IR pass to fix the buffer binding issues). Should fix #264, fix #230 and fix #219 and fix #154s.subgraph()
. This is needed for dataflow support #245test_runtime_build
andtest_schedule_stream
hls::stream
to implement FIFOs in Vivado HLS backend (NB, this is less stable than pragma annotated multi-dimensional array, based on what we observed in Vivado HLS 2019.2). Should fix #286Other new features
HCL_DEBUG_LEVEL(level)
MACRO (enabled when HCL_DEBUG_ON is set)s.to(tensor, p.xcel.BRAM)
). Should fix #281Error in python: double free or corruption (!prev)
Fixed bugs
loop.1
toloop_1
)