Closed hecmay closed 4 years ago
Multi-device support (to allow users to split the workload into multiple FPGA boards).
This is cool. But what's the user case and how to test such feature?
Enhance the graph partitioning algorithm for op scheduling. For example, the algorithm assumes that the users specifies all the data movement to form a subgraph in the CFG, which is not always true.
The graph partitioning is used for support .to()? Can you open a separate issue to list the potential bugs due to the assumptions we are making?
The user case would be, for example, to split a large workload or to duplication compute units for data parallelism. For now we only test the correctness of generated IR, the generated program can be tested on AWS F1 instance with multiple FPGA cards.
To be strict, I should call it subgraph extraction algorithm. it's not the classic graph partitioning algorithm (where the whole graph is partitioned into balanced subgraphs and communication is minimized). Sure I can create an issue.
Added a simple resource reporting function. Example output:
[21:33:10] Generating harness files ...
[21:33:10] Compiling the program ...
[01:33:33] Resource consumption
[--------] Clock Period : 5.806
[--------] Best Latency : 2303
[--------] Average Latency : 2303
[--------] Worst Latency : 2303
[--------] BRAM : 2
[--------] FF : 154
[--------] LUT : 425
[--------] DSP48E : 0
[21:33:33] Execution complete
shmop
is not that stable and may cause SegFault)An example for kernel duplication: @Huyuwei
def test_merge_kernel_stages():
hcl.init()
A = hcl.placeholder((10, 32), "A")
B = hcl.placeholder((10, 32), "B")
def kernel(A, B):
C = hcl.compute(A.shape, lambda i, j: 0, "C")
hcl.update(C, lambda i, j: A[i,j] + 1, "s1")
hcl.update(C, lambda i, j: B[i,j] * 2, "s2")
return hcl.compute(C.shape, lambda *args: C[args] + 3, "ret")
target = hcl.platform.aws_f1
s = hcl.create_schedule([A, B], kernel)
A_, B_ = s.to([A, B], target.xcel)
ret_ = s.to(kernel.ret, target.host)
kernel = s.subgraph(inputs=[A_, B_], outputs=[ret_])
kernel.duplicate(factor=2)
Can you fix the conflicts first? Also, for each of the new features, is there a corresponding test case? If so, can you point that out so it'd be easier to look? For example, after each feature, write the test case name.
Oh, this is not ready to review? I looked at the wrong PR.
Features to be added in the next PR:
p = hcl.platform.custom(config)
s = hcl.create_schedule([A, B], kernel)
s.to(A, p.xcel[1])
s.to(B, p.xcel[2])
// First CONV Layer
unrolled (yo, 0, N) {
unrolled (xo, 0, M) {
pipelined (xi, 0, K) {
for (yi, 0, L ) {
local = 0
for (kx, 0, K1) {
for (ky, 0, K2) {
local += Input[...] * kernel[...]
}
}
O[...] = local
channel[yo, xo].write(local)
}
}
}
}
// Next CONV Layer
unrolled (yo, 0, N) {
unrolled (xo, 0, M) {
// shift register as reuse line buffer
buffer[yo, xo].insert(channel[yo, xo].read())
pipelined (xi, 0, K) {
for (yi, 0, L) {
local = 0
for (kx, 0, K1) {
for (ky, 0, K2) {
local += buffer[...] * kernel[...]
}
}
O[...] = local
}
}
}
}
@seanlatias Update the description. For some of the minor changes (e.g. HLS reporting function, multi-dimension tensor), I did not put the link to test cases here.
The second one is still unclear to me. In your example, both yo
and xo
should be paralleled, right? For the second conv layer, are yo
and xo
also paralleled? The shift register is also inferred automatically? When will it be shifted? I suppose what you are doing is just simply performing tiling, which results in multiple PEs for the first conv layer? How about the second conv? Do we have a single PE or multiple PEs?
The second one is still unclear to me. In your example, both
yo
andxo
should be paralleled, right? For the second conv layer, areyo
andxo
also paralleled? The shift register is also inferred automatically? When will it be shifted? I suppose what you are doing is just simply performing tiling, which results in multiple PEs for the first conv layer? How about the second conv? Do we have a single PE or multiple PEs?
Yes, you are right. yo and xo should be running in parallel for both two layers. I suppose this can be realized by inserting dataflow pragma on top of these two layers and unrolling the xo/yo axes?
The second layer should also be tiled into multiple PEs. I plan to use the reuse_at
primitive to generate the shift registers. This is not fully realized yet. I only have a simple example to showcase the streaming arrays.
Most changes have been addressed. I think we need to consider Copy
and ZeroCopy
case separately. I can put it into another PR.
@Hecmay, can you also add a test for #240? Thanks.
Sure. Here is the test case: https://github.com/Hecmay/heterocl/blob/stream_to/tests/test_schedule_stream.py#L531
The allocate statement for partitioned stage is not optimized away. But it does not matter that much. I suppose there might be something wrong the lift_alloc_attr
pass.
In this PR, we will introduce:
.to
primitive. Here is an example for creating channels between Systolic Array PEs. The .to primitive creates self-loopback streaming channels, and pass values between PEs:s[k].pipeline(k.axis[0])
self loopback streaming
s.to(k.localA, kernel.update) s.to(k.localB, kernel.update)