[Backend][Op] Decomposition and Parallel Compilation for Large-Scale Designs

chhzh123 commented 2 years ago

As application designs become larger and larger, it is hard to optimize, compile, and debug them efficiently. For example, a typical ResNet-20 model written in HeteroCL may generate thousands of lines of MLIR code with nearly 100 stages and takes about 1 hour to synthesize in Vivado HLS. Current HeteroCL can only support monolithic compilation, which means every time users change the design a bit (e.g., add an optimization primitive in a stage), they need to recompile the whole design. This method is very time-consuming and is not suitable for agile development and design space exploration that requires quick feedback from synthesis tools.

To mitigate this problem, we can decompose a large design into small pieces, optimize, and compile them separately. The key to this decomposition is the introduction of the .outline() primitive. As mentioned in this proposal, the .outline() primitive is used to make a stage or consecutive stages as a function. Here, we further formalize two use cases of the primitive:

<stage>.outline(): The outline primitive can be directly applied to a stage. This stage will be automatically outlined as a function.
s.outline([<stage_list1>],[<stage_list2>], ...): This method can group multiple stages in a list as a new function, which is very useful to form a dataflow region.

The following example shows a program with two dataflow paths, and we try to outline each stage using the <stage>.outline() method.

def test_outline():

    A = hcl.placeholder((32, 32), "A")

    def kernel(A):
        B = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "B")
        C = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "C")
        D = hcl.compute(A.shape, lambda i, j: B[i, j] + C[i, j], "D")
        return D

    target = hcl.Platform.xilinx_zc706
    target.config(compiler="vivado_hls", mode="debug",
                  project="outline.prj")
    s = hcl.create_schedule([A], kernel)
    s[kernel.B].pipeline(kernel.B.axis[1])
    s.partition(kernel.B, dim=2)
    # outline stages
    func_B = s[kernel.B].outline()
    func_C = s[kernel.C].outline()
    func_D = s[kernel.D].outline()
    print(hcl.lower(s))

    mod = hcl.build(s, top=[func_B, func_C, func_D], target=target)
    mod()

The outline primitive is again a customization operation in our HCL-MLIR dialect, so it can be easily applied during the loop transformation pass. Following shows the transformed MLIR code. We can see that the three stages are all outlined as functions named @Stage_X with well-captured input and output parameters, and the optimizations for specific stages are retained (e.g., the affine #map generated by partition primitive).

#map = affine_map<(d0, d1) -> (0, d1, d0, 0)>
module {
  func private @Stage_D(%arg0: memref<32x32xi32, #map>, %arg1: memref<32x32xi32>, %arg2: memref<32x32xi32>) {
    affine.for %arg3 = 0 to 32 {
      affine.for %arg4 = 0 to 32 {
        %0 = affine.load %arg0[%arg3, %arg4] {from = "B"} : memref<32x32xi32, #map>
        %1 = affine.load %arg1[%arg3, %arg4] {from = "C"} : memref<32x32xi32>
        %2 = arith.addi %0, %1 : i32
        affine.store %2, %arg2[%arg3, %arg4] {to = "D"} : memref<32x32xi32>
      } {loop_name = "j"}
    } {loop_name = "i", stage_name = "D"}
    return
  }
  func private @Stage_C(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) {
    affine.for %arg2 = 0 to 32 {
      affine.for %arg3 = 0 to 32 {
        %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<32x32xi32>
        %c1_i32 = arith.constant 1 : i32
        %1 = arith.addi %0, %c1_i32 : i32
        affine.store %1, %arg1[%arg2, %arg3] {to = "C"} : memref<32x32xi32>
      } {loop_name = "j"}
    } {loop_name = "i", stage_name = "C"}
    return
  }
  func private @Stage_B(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32, #map>) {
    affine.for %arg2 = 0 to 32 {
      affine.for %arg3 = 0 to 32 {
        %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<32x32xi32>
        %c1_i32 = arith.constant 1 : i32
        %1 = arith.addi %0, %c1_i32 : i32
        affine.store %1, %arg1[%arg2, %arg3] {to = "B"} : memref<32x32xi32, #map>
      } {loop_name = "j", pipeline_ii = 1 : i32}
    } {loop_name = "i", stage_name = "B"}
    return
  }
  func @top(%arg0: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<32x32xi32, #map>
    call @Stage_B(%arg0, %0) : (memref<32x32xi32>, memref<32x32xi32, #map>) -> ()
    %1 = memref.alloc() {name = "C"} : memref<32x32xi32>
    call @Stage_C(%arg0, %1) : (memref<32x32xi32>, memref<32x32xi32>) -> ()
    %2 = memref.alloc() {name = "D"} : memref<32x32xi32>
    call @Stage_D(%0, %1, %2) : (memref<32x32xi32, #map>, memref<32x32xi32>, memref<32x32xi32>) -> ()
    return %2 : memref<32x32xi32>
  }
}

Since the outlined stages are now independent components, we can actually compile them separately. We make the .outline() primitive return a function handle, and pass it to hcl.build. As shown in the following two lines, we introduce a new parameter called top to specify the top-level function. By default, top is None, so the design will be compiled as a whole. But if we pass in the function handles, those stage functions will be treated as different modules. A super module containing those submodules will be built. By calling the built mod, the compilation of func_B, func_C, and func_D will be invoked simultaneously. Three Vivado HLS processes will be created in this case, which can have 3x compilation time reduction in the best case.

mod = hcl.build(s, top=[func_B, func_C, func_D], target=target)
mod()

The .outline() primitive is actually a top-down approach and can be viewed as compensation for @zzzDavid 's hierarchical schedule (bottom-up methodology). Two points to illustrate why a top-down approach is also needed:

Sometimes users do not know how to partition their design. How to decompose a large design and connect the components using the correct interface is a hard problem and may greatly impact the performance of the whole design. DSE can also benefit from this primitive by making quick trails and errors to form different stages into functions.
Users have already built up a large design and only want to know the performance of certain parts and they do not want to refactor their code a lot. For example, in the above case, if one wants to have a glimpse of the latency of Stage_B, using a bottom-up method, he needs to figure out the inputs and outputs of B, factor out the code, and create a new schedule for this stage. A similar process is needed for Stage_C and Stage_D, which is tedious and harms the original codebase. Considering lots of legacy HeteroCL codes are monolithic designs, the .outline() primitive can maintain the original functionality without changing the algorithm part but also speed up the development process.

The basic facility of the .outline() primitive has been implemented in commit 06b7b5 in this repo and 14b5f8 in the HeteroCL repo.

chhzh123 commented 2 years ago

Also support CPU simulation for outlined functions.

def test_outline_cpu():

    A = hcl.placeholder((32, 32), "A")

    def kernel(A):
        B = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "B")
        C = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "C")
        D = hcl.compute(A.shape, lambda i, j: B[i, j] + C[i, j], "D")
        return D

    s = hcl.create_schedule([A], kernel)
    func_B_C, func_D = s.outline([s[kernel.B], s[kernel.C]], [s[kernel.D]])

    mod = hcl.build(s, top=[func_B_C, func_D], target=None)
    np_A, np_B, np_C, np_D = [np.zeros((32, 32))] * 4
    hcl_A = hcl.asarray(np_A)
    hcl_B = hcl.asarray(np_B)
    hcl_C = hcl.asarray(np_C)
    hcl_D = hcl.asarray(np_D)
    mod.modules[0](hcl_A, hcl_B, hcl_C)
    mod.modules[1](hcl_B, hcl_C, hcl_D)
    print(hcl_D.asnumpy())

chhzh123 commented 2 years ago

Added function sharing mechanism. Users can use <stage>.outline(merge=<func>) to specify which functions they want to merge together, where <func> is the function handle.

def test_outline():

    A = hcl.placeholder((32, 32), "A")

    def kernel(A):
        B = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "B")
        C = hcl.compute(A.shape, lambda i, j: A[i, j] + 1, "C")
        D = hcl.compute(A.shape, lambda i, j: B[i, j] + C[i, j], "D")
        return D

    target = hcl.Platform.xilinx_zc706
    target.config(compiler="vivado_hls", mode="debug",
                  project="stages-outline.prj")
    s = hcl.create_schedule([A], kernel)
    s[kernel.B].pipeline(kernel.B.axis[1])
    func_B = s[kernel.B].outline()
    func_C = s[kernel.C].outline(merge=func_B)
    func_D = s[kernel.D].outline()
    print(s.device_module)
    print(hcl.lower(s))

At the MLIR level, we attach an attribute to the outline operation, e.g., hcl.outline(%7) {merge = "Stage_B"}. Finally, we generate the following code, we can see that only two outlined functions are generated but with three function calls.

module {
  func private @Stage_B(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) attributes {bit, itypes = "s_"} {
    affine.for %arg2 = 0 to 32 {
      affine.for %arg3 = 0 to 32 {
        %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<32x32xi32>
        %c1_i32 = arith.constant 1 : i32
        %1 = arith.addi %0, %c1_i32 : i32
        affine.store %1, %arg1[%arg2, %arg3] {to = "B"} : memref<32x32xi32>
      } {loop_name = "j", pipeline_ii = 1 : i32}
    } {loop_name = "i", stage_name = "B"}
    return
  }
  func private @Stage_D(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>, %arg2: memref<32x32xi32>) attributes {bit, itypes = "___"} {
    affine.for %arg3 = 0 to 32 {
      affine.for %arg4 = 0 to 32 {
        %0 = affine.load %arg0[%arg3, %arg4] {from = "B"} : memref<32x32xi32>
        %1 = affine.load %arg1[%arg3, %arg4] {from = "C"} : memref<32x32xi32>
        %2 = arith.addi %0, %1 : i32
        affine.store %2, %arg2[%arg3, %arg4] {to = "D"} : memref<32x32xi32>
      } {loop_name = "j"}
    } {loop_name = "i", stage_name = "D"}
    return
  }
  func @top(%arg0: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<32x32xi32>
    call @Stage_B(%arg0, %0) : (memref<32x32xi32>, memref<32x32xi32>) -> ()
    %1 = memref.alloc() {name = "C"} : memref<32x32xi32>
    call @Stage_B(%arg0, %1) : (memref<32x32xi32>, memref<32x32xi32>) -> ()
    %2 = memref.alloc() {name = "D"} : memref<32x32xi32>
    call @Stage_D(%0, %1, %2) : (memref<32x32xi32>, memref<32x32xi32>, memref<32x32xi32>) -> ()
    return %2 : memref<32x32xi32>
  }
}

chhzh123 commented 2 years ago

Added support for "param" and "axis". See this commit.

module {
  func @top(%arg0: memref<10x32xi32>) -> memref<10x32xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "C"} : memref<10x32xi32>
    %1 = hcl.create_loop_handle "i" : !hcl.LoopHandle
    %2 = hcl.create_loop_handle "j" : !hcl.LoopHandle
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 32 {
        %12 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<10x32xi32>
        %c1_i32 = arith.constant 1 : i32
        %13 = arith.addi %12, %c1_i32 : i32
        affine.store %13, %0[%arg1, %arg2] {to = "C"} : memref<10x32xi32>
      } {loop_name = "j"}
    } {loop_name = "i", stage_name = "C"}
    %3 = hcl.create_stage_handle "C" : !hcl.StageHandle
    %4 = memref.alloc() {name = "D"} : memref<10x32xi32>
    %5 = hcl.create_loop_handle "i" : !hcl.LoopHandle
    %6 = hcl.create_loop_handle "j" : !hcl.LoopHandle
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 32 {
        %12 = affine.load %0[%arg1, %arg2] {from = "C"} : memref<10x32xi32>
        %c2_i32 = arith.constant 2 : i32
        %13 = arith.muli %12, %c2_i32 : i32
        affine.store %13, %4[%arg1, %arg2] {to = "D"} : memref<10x32xi32>
      } {loop_name = "j"}
    } {loop_name = "i", stage_name = "D"}
    %7 = hcl.create_stage_handle "D" : !hcl.StageHandle
    %8 = memref.alloc() {name = "E"} : memref<10x32xi32>
    %9 = hcl.create_loop_handle "i" : !hcl.LoopHandle
    %10 = hcl.create_loop_handle "j" : !hcl.LoopHandle
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 32 {
        %12 = affine.load %4[%arg1, %arg2] {from = "D"} : memref<10x32xi32>
        %c3_i32 = arith.constant 3 : i32
        %13 = arith.muli %12, %c3_i32 : i32
        affine.store %13, %8[%arg1, %arg2] {to = "E"} : memref<10x32xi32>
      } {loop_name = "j"}
    } {loop_name = "i", stage_name = "E"}
    %11 = hcl.create_stage_handle "E" : !hcl.StageHandle
    hcl.outline (%3) {param = ["i", "j"]}
    hcl.outline (%7) {param = ["i", "j"], merge="Stage_C"}
    hcl.outline (%11) {axis = "j"}
    return %8 : memref<10x32xi32>
  }
}

The generated code is shown below.

module {
  func private @Stage_C(%arg0: memref<10x32xi32>, %arg1: memref<10x32xi32>, %arg2: index, %arg3: index) attributes {bit, itypes = "s_"} {
    affine.for %arg4 = 0 to %arg2 {
      affine.for %arg5 = 0 to %arg3 {
        %0 = affine.load %arg0[%arg4, %arg5] {from = "A"} : memref<10x32xi32>
        %c1_i32 = arith.constant 1 : i32
        %1 = arith.addi %0, %c1_i32 : i32
        affine.store %1, %arg1[%arg4, %arg5] {to = "C"} : memref<10x32xi32>
      } {loop_name = "j"}
    } {loop_name = "i", stage_name = "C"}
    return
  }
  func private @Stage_E(%arg0: memref<10x32xi32>, %arg1: memref<10x32xi32>, %arg2: index) attributes {bit, itypes = "__"} {
    affine.for %arg3 = 0 to 32 {
      %0 = affine.load %arg0[%arg2, %arg3] {from = "D"} : memref<10x32xi32>
      %c3_i32 = arith.constant 3 : i32
      %1 = arith.muli %0, %c3_i32 : i32
      affine.store %1, %arg1[%arg2, %arg3] {to = "E"} : memref<10x32xi32>
    } {loop_name = "j"}
    return
  }
  func @top(%arg0: memref<10x32xi32>) -> memref<10x32xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "C"} : memref<10x32xi32>
    %c10 = arith.constant 10 : index
    %c32 = arith.constant 32 : index
    call @Stage_C(%arg0, %0, %c10, %c32) : (memref<10x32xi32>, memref<10x32xi32>, index, index) -> ()
    %1 = memref.alloc() {name = "D"} : memref<10x32xi32>
    %c10_0 = arith.constant 10 : index
    %c32_1 = arith.constant 32 : index
    call @Stage_C(%0, %1, %c10_0, %c32_1) : (memref<10x32xi32>, memref<10x32xi32>, index, index) -> ()
    %2 = memref.alloc() {name = "E"} : memref<10x32xi32>
    affine.for %arg1 = 0 to 10 {
      call @Stage_E(%1, %2, %arg1) : (memref<10x32xi32>, memref<10x32xi32>, index) -> ()
    } {loop_name = "i", stage_name = "E"}
    return %2 : memref<10x32xi32>
  }
}

cornell-zhang / hcl-dialect

[Backend][Op] Decomposition and Parallel Compilation for Large-Scale Designs #85