cornell-zhang / hcl-dialect

HeteroCL-MLIR dialect for accelerator design
https://cornell-zhang.github.io/heterocl/index.html
Other
40 stars 17 forks source link

[Op] Parameterized Customization Template #105

Closed chhzh123 closed 1 year ago

chhzh123 commented 2 years ago

Many applications are built with small kernels, while many kernels may have different sizes of inputs, the computation patterns are exactly the same. Considering a neural network with tens of convolutional kernels, it may be tedious for users to manually add optimization primitives for each of them, even though they use the same optimization method.

To tackle the above challenge and provide a better interface for users to declare and reuse optimizations, we propose parameterized customization template. Three new operations are introduced:

The following code snippet gives an example of defining a sequence of optimizations for a GEMM kernel. Here we take in two input matrices A and B and an output matrix C, with the stage handle and the loop handles. We can see that the types and the sizes of the tensor need not be given, since the optimization should work for different inputs once it is a GEMM kernel. We use ? to match a dimension with un unknown size, and !hcl.Type to match generic types. This provides a polymorphic way to define optimizations.

    hcl.customization @gemm_opt(
        %A: memref<?x?x!hcl.Type>,
        %B: memref<?x?x!hcl.Type>,
        %C: memref<?x?x!hcl.Type>,
        %s: !hcl.StageHandle,
        %i: !hcl.LoopHandle,
        %j: !hcl.LoopHandle,
        %k: !hcl.LoopHandle
    ) {
        hcl.pipeline(%s, %j, 1)
        hcl.partition(%A: memref<?x?x!hcl.Type>, "CompletePartition", 2)
        hcl.partition(%B: memref<?x?x!hcl.Type>, "CompletePartition", 2)
        hcl.partition(%C: memref<?x?x!hcl.Type>, "CompletePartition", 2)
        hcl.end
    }

Defining optimizations like this, we can have the below advantages:

The algorithm part is shown below, where there are two GEMM kernels with different size. We can easily apply the same customization without writing extra code.

module {
    func @top(%A: memref<64x32xi32>, %B: memref<32x64xi32>, %C: memref<64x64xi32>) -> memref<64x64xi32>
    {
        %s1 = hcl.create_stage_handle "s1" : !hcl.StageHandle
        %i1 = hcl.create_loop_handle "i1" : !hcl.LoopHandle
        %j1 = hcl.create_loop_handle "j1" : !hcl.LoopHandle
        %k1 = hcl.create_loop_handle "k1" : !hcl.LoopHandle
        // D = A * B
        %D = memref.alloc() : memref<64x64xi32>
        affine.for %i = 0 to 64 {
            affine.for %j = 0 to 64 {
                affine.for %k = 0 to 32 {
                    %a = affine.load %A[%i, %k] : memref<64x32xi32>
                    %b = affine.load %B[%k, %j] : memref<32x64xi32>
                    %c = affine.load %D[%i, %j] : memref<64x64xi32>
                    %prod = arith.muli %a, %b : i32
                    %sum = arith.addi %prod, %c: i32
                    affine.store %sum, %D[%i, %j] : memref<64x64xi32>
                } { loop_name = "k1" }
            } { loop_name = "j1" }
        } { loop_name = "i1", stage_name = "s1" }
        %s2 = hcl.create_stage_handle "s2" : !hcl.StageHandle
        %i2 = hcl.create_loop_handle "i2" : !hcl.LoopHandle
        %j2 = hcl.create_loop_handle "j2" : !hcl.LoopHandle
        %k2 = hcl.create_loop_handle "k2" : !hcl.LoopHandle
        // E = C * D
        %E = memref.alloc() : memref<64x64xi32>
        affine.for %i = 0 to 64 {
            affine.for %j = 0 to 64 {
                affine.for %k = 0 to 64 {
                    %c = affine.load %C[%i, %k] : memref<64x64xi32>
                    %d = affine.load %D[%k, %j] : memref<64x64xi32>
                    %e = affine.load %E[%i, %j] : memref<64x64xi32>
                    %prod = arith.muli %c, %d : i32
                    %sum = arith.addi %prod, %e: i32
                    affine.store %sum, %E[%i, %j] : memref<64x64xi32>
                } { loop_name = "k2" }
            } { loop_name = "j2" }
        } { loop_name = "i2", stage_name = "s2" }
        hcl.apply @gemm_opt(%A, %B, %D, %s1, %i1, %j1, %k1) : (memref<64x32xi32>, memref<32x64xi32>, memref<64x64xi32>, !hcl.StageHandle, !hcl.LoopHandle, !hcl.LoopHandle, !hcl.LoopHandle) -> ()
        hcl.apply @gemm_opt(%C, %D, %E, %s2, %i2, %j2, %k2) : (memref<64x64xi32>, memref<64x64xi32>, memref<64x64xi32>, !hcl.StageHandle, !hcl.LoopHandle, !hcl.LoopHandle, !hcl.LoopHandle) -> ()
        return %E : memref<64x64xi32>
    }
}

Later users can also add constraints to the customization to further specify what kind of kernels can use this optimization. Since the customization is a template, partial specialization can later be supported.

zzzDavid commented 1 year ago

Closing this issue as customization template has been implemented, test cases are added here: https://github.com/cornell-zhang/hcl-dialect-prototype/tree/main/test/Transforms/template