Closed chhzh123 closed 2 years ago
Also support CPU simulation for outlined functions.
def test_outline_cpu():
A = hcl.placeholder((32, 32), "A")
def kernel(A):
B = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "B")
C = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "C")
D = hcl.compute(A.shape, lambda i, j: B[i, j] + C[i, j], "D")
return D
s = hcl.create_schedule([A], kernel)
func_B_C, func_D = s.outline([s[kernel.B], s[kernel.C]], [s[kernel.D]])
mod = hcl.build(s, top=[func_B_C, func_D], target=None)
np_A, np_B, np_C, np_D = [np.zeros((32, 32))] * 4
hcl_A = hcl.asarray(np_A)
hcl_B = hcl.asarray(np_B)
hcl_C = hcl.asarray(np_C)
hcl_D = hcl.asarray(np_D)
mod.modules[0](hcl_A, hcl_B, hcl_C)
mod.modules[1](hcl_B, hcl_C, hcl_D)
print(hcl_D.asnumpy())
Added function sharing mechanism. Users can use <stage>.outline(merge=<func>)
to specify which functions they want to merge together, where <func>
is the function handle.
def test_outline():
A = hcl.placeholder((32, 32), "A")
def kernel(A):
B = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "B")
C = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "C")
D = hcl.compute(A.shape, lambda i, j: B[i, j] + C[i, j], "D")
return D
target = hcl.Platform.xilinx_zc706
target.config(compiler="vivado_hls", mode="debug",
project="stages-outline.prj")
s = hcl.create_schedule([A], kernel)
s[kernel.B].pipeline(kernel.B.axis[1])
func_B = s[kernel.B].outline()
func_C = s[kernel.C].outline(merge=func_B)
func_D = s[kernel.D].outline()
print(s.device_module)
print(hcl.lower(s))
At the MLIR level, we attach an attribute to the outline
operation, e.g., hcl.outline(%7) {merge = "Stage_B"}
. Finally, we generate the following code, we can see that only two outlined functions are generated but with three function calls.
module {
func private @Stage_B(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) attributes {bit, itypes = "s_"} {
affine.for %arg2 = 0 to 32 {
affine.for %arg3 = 0 to 32 {
%0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<32x32xi32>
%c1_i32 = arith.constant 1 : i32
%1 = arith.addi %0, %c1_i32 : i32
affine.store %1, %arg1[%arg2, %arg3] {to = "B"} : memref<32x32xi32>
} {loop_name = "j", pipeline_ii = 1 : i32}
} {loop_name = "i", stage_name = "B"}
return
}
func private @Stage_D(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>, %arg2: memref<32x32xi32>) attributes {bit, itypes = "___"} {
affine.for %arg3 = 0 to 32 {
affine.for %arg4 = 0 to 32 {
%0 = affine.load %arg0[%arg3, %arg4] {from = "B"} : memref<32x32xi32>
%1 = affine.load %arg1[%arg3, %arg4] {from = "C"} : memref<32x32xi32>
%2 = arith.addi %0, %1 : i32
affine.store %2, %arg2[%arg3, %arg4] {to = "D"} : memref<32x32xi32>
} {loop_name = "j"}
} {loop_name = "i", stage_name = "D"}
return
}
func @top(%arg0: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "s", otypes = "s"} {
%0 = memref.alloc() {name = "B"} : memref<32x32xi32>
call @Stage_B(%arg0, %0) : (memref<32x32xi32>, memref<32x32xi32>) -> ()
%1 = memref.alloc() {name = "C"} : memref<32x32xi32>
call @Stage_B(%arg0, %1) : (memref<32x32xi32>, memref<32x32xi32>) -> ()
%2 = memref.alloc() {name = "D"} : memref<32x32xi32>
call @Stage_D(%0, %1, %2) : (memref<32x32xi32>, memref<32x32xi32>, memref<32x32xi32>) -> ()
return %2 : memref<32x32xi32>
}
}
Added support for "param"
and "axis"
. See this commit.
module {
func @top(%arg0: memref<10x32xi32>) -> memref<10x32xi32> attributes {itypes = "s", otypes = "s"} {
%0 = memref.alloc() {name = "C"} : memref<10x32xi32>
%1 = hcl.create_loop_handle "i" : !hcl.LoopHandle
%2 = hcl.create_loop_handle "j" : !hcl.LoopHandle
affine.for %arg1 = 0 to 10 {
affine.for %arg2 = 0 to 32 {
%12 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<10x32xi32>
%c1_i32 = arith.constant 1 : i32
%13 = arith.addi %12, %c1_i32 : i32
affine.store %13, %0[%arg1, %arg2] {to = "C"} : memref<10x32xi32>
} {loop_name = "j"}
} {loop_name = "i", stage_name = "C"}
%3 = hcl.create_stage_handle "C" : !hcl.StageHandle
%4 = memref.alloc() {name = "D"} : memref<10x32xi32>
%5 = hcl.create_loop_handle "i" : !hcl.LoopHandle
%6 = hcl.create_loop_handle "j" : !hcl.LoopHandle
affine.for %arg1 = 0 to 10 {
affine.for %arg2 = 0 to 32 {
%12 = affine.load %0[%arg1, %arg2] {from = "C"} : memref<10x32xi32>
%c2_i32 = arith.constant 2 : i32
%13 = arith.muli %12, %c2_i32 : i32
affine.store %13, %4[%arg1, %arg2] {to = "D"} : memref<10x32xi32>
} {loop_name = "j"}
} {loop_name = "i", stage_name = "D"}
%7 = hcl.create_stage_handle "D" : !hcl.StageHandle
%8 = memref.alloc() {name = "E"} : memref<10x32xi32>
%9 = hcl.create_loop_handle "i" : !hcl.LoopHandle
%10 = hcl.create_loop_handle "j" : !hcl.LoopHandle
affine.for %arg1 = 0 to 10 {
affine.for %arg2 = 0 to 32 {
%12 = affine.load %4[%arg1, %arg2] {from = "D"} : memref<10x32xi32>
%c3_i32 = arith.constant 3 : i32
%13 = arith.muli %12, %c3_i32 : i32
affine.store %13, %8[%arg1, %arg2] {to = "E"} : memref<10x32xi32>
} {loop_name = "j"}
} {loop_name = "i", stage_name = "E"}
%11 = hcl.create_stage_handle "E" : !hcl.StageHandle
hcl.outline (%3) {param = ["i", "j"]}
hcl.outline (%7) {param = ["i", "j"], merge="Stage_C"}
hcl.outline (%11) {axis = "j"}
return %8 : memref<10x32xi32>
}
}
The generated code is shown below.
module {
func private @Stage_C(%arg0: memref<10x32xi32>, %arg1: memref<10x32xi32>, %arg2: index, %arg3: index) attributes {bit, itypes = "s_"} {
affine.for %arg4 = 0 to %arg2 {
affine.for %arg5 = 0 to %arg3 {
%0 = affine.load %arg0[%arg4, %arg5] {from = "A"} : memref<10x32xi32>
%c1_i32 = arith.constant 1 : i32
%1 = arith.addi %0, %c1_i32 : i32
affine.store %1, %arg1[%arg4, %arg5] {to = "C"} : memref<10x32xi32>
} {loop_name = "j"}
} {loop_name = "i", stage_name = "C"}
return
}
func private @Stage_E(%arg0: memref<10x32xi32>, %arg1: memref<10x32xi32>, %arg2: index) attributes {bit, itypes = "__"} {
affine.for %arg3 = 0 to 32 {
%0 = affine.load %arg0[%arg2, %arg3] {from = "D"} : memref<10x32xi32>
%c3_i32 = arith.constant 3 : i32
%1 = arith.muli %0, %c3_i32 : i32
affine.store %1, %arg1[%arg2, %arg3] {to = "E"} : memref<10x32xi32>
} {loop_name = "j"}
return
}
func @top(%arg0: memref<10x32xi32>) -> memref<10x32xi32> attributes {itypes = "s", otypes = "s"} {
%0 = memref.alloc() {name = "C"} : memref<10x32xi32>
%c10 = arith.constant 10 : index
%c32 = arith.constant 32 : index
call @Stage_C(%arg0, %0, %c10, %c32) : (memref<10x32xi32>, memref<10x32xi32>, index, index) -> ()
%1 = memref.alloc() {name = "D"} : memref<10x32xi32>
%c10_0 = arith.constant 10 : index
%c32_1 = arith.constant 32 : index
call @Stage_C(%0, %1, %c10_0, %c32_1) : (memref<10x32xi32>, memref<10x32xi32>, index, index) -> ()
%2 = memref.alloc() {name = "E"} : memref<10x32xi32>
affine.for %arg1 = 0 to 10 {
call @Stage_E(%1, %2, %arg1) : (memref<10x32xi32>, memref<10x32xi32>, index) -> ()
} {loop_name = "i", stage_name = "E"}
return %2 : memref<10x32xi32>
}
}
As application designs become larger and larger, it is hard to optimize, compile, and debug them efficiently. For example, a typical ResNet-20 model written in HeteroCL may generate thousands of lines of MLIR code with nearly 100 stages and takes about 1 hour to synthesize in Vivado HLS. Current HeteroCL can only support monolithic compilation, which means every time users change the design a bit (e.g., add an optimization primitive in a stage), they need to recompile the whole design. This method is very time-consuming and is not suitable for agile development and design space exploration that requires quick feedback from synthesis tools.
To mitigate this problem, we can decompose a large design into small pieces, optimize, and compile them separately. The key to this decomposition is the introduction of the
.outline()
primitive. As mentioned in this proposal, the.outline()
primitive is used to make a stage or consecutive stages as a function. Here, we further formalize two use cases of the primitive:<stage>.outline()
: The outline primitive can be directly applied to a stage. This stage will be automatically outlined as a function.s.outline([<stage_list1>],[<stage_list2>], ...)
: This method can group multiple stages in a list as a new function, which is very useful to form a dataflow region.The following example shows a program with two dataflow paths, and we try to outline each stage using the
<stage>.outline()
method.The outline primitive is again a customization operation in our HCL-MLIR dialect, so it can be easily applied during the loop transformation pass. Following shows the transformed MLIR code. We can see that the three stages are all outlined as functions named
@Stage_X
with well-captured input and output parameters, and the optimizations for specific stages are retained (e.g., the affine#map
generated by partition primitive).Since the outlined stages are now independent components, we can actually compile them separately. We make the
.outline()
primitive return a function handle, and pass it tohcl.build
. As shown in the following two lines, we introduce a new parameter calledtop
to specify the top-level function. By default,top
isNone
, so the design will be compiled as a whole. But if we pass in the function handles, those stage functions will be treated as different modules. A super module containing those submodules will be built. By calling the builtmod
, the compilation offunc_B
,func_C
, andfunc_D
will be invoked simultaneously. Three Vivado HLS processes will be created in this case, which can have 3x compilation time reduction in the best case.The
.outline()
primitive is actually a top-down approach and can be viewed as compensation for @zzzDavid 's hierarchical schedule (bottom-up methodology). Two points to illustrate why a top-down approach is also needed:Stage_B
, using a bottom-up method, he needs to figure out the inputs and outputs ofB
, factor out the code, and create a new schedule for this stage. A similar process is needed forStage_C
andStage_D
, which is tedious and harms the original codebase. Considering lots of legacy HeteroCL codes are monolithic designs, the.outline()
primitive can maintain the original functionality without changing the algorithm part but also speed up the development process.The basic facility of the
.outline()
primitive has been implemented in commit 06b7b5 in this repo and 14b5f8 in the HeteroCL repo.