[Op] Parameterized Customization Template

Many applications are built with small kernels, while many kernels may have different sizes of inputs, the computation patterns are exactly the same. Considering a neural network with tens of convolutional kernels, it may be tedious for users to manually add optimization primitives for each of them, even though they use the same optimization method.

To tackle the above challenge and provide a better interface for users to declare and reuse optimizations, we propose parameterized customization template. Three new operations are introduced:

hcl.customization: A customization is basically a sequence of optimization primitives. It can take in a stage with the loop axes and input/output tensors, and apply those primitives. It has a similar interface with builtin.func, where the optimization primitives are written in the body.
hcl.apply: This operation applies a customization to an algorithm. It acts like a call operation.
hcl.end: The terminator of the region of hcl.customization.

The following code snippet gives an example of defining a sequence of optimizations for a GEMM kernel. Here we take in two input matrices A and B and an output matrix C, with the stage handle and the loop handles. We can see that the types and the sizes of the tensor need not be given, since the optimization should work for different inputs once it is a GEMM kernel. We use ? to match a dimension with un unknown size, and !hcl.Type to match generic types. This provides a polymorphic way to define optimizations.

    hcl.customization @gemm_opt(
        %A: memref<?x?x!hcl.Type>,
        %B: memref<?x?x!hcl.Type>,
        %C: memref<?x?x!hcl.Type>,
        %s: !hcl.StageHandle,
        %i: !hcl.LoopHandle,
        %j: !hcl.LoopHandle,
        %k: !hcl.LoopHandle
    ) {
        hcl.pipeline(%s, %j, 1)
        hcl.partition(%A: memref<?x?x!hcl.Type>, "CompletePartition", 2)
        hcl.partition(%B: memref<?x?x!hcl.Type>, "CompletePartition", 2)
        hcl.partition(%C: memref<?x?x!hcl.Type>, "CompletePartition", 2)
        hcl.end
    }

Defining optimizations like this, we can have the below advantages:

Modularity: It can be fully decoupled from algorithm specification. It is no longer a monolithic design, and users can easily know those optimizations are for which kernel.
Portability: The customization can be stored in a separate file and can be easily imported into different application programs.
Composability: The customization is not only in a readable format, but can also be parsed by the MLIR compiler. Users can read in the customization, add new optimizations, and store them back to file. It would be also helpful for the compiler to perform DSE.

The algorithm part is shown below, where there are two GEMM kernels with different size. We can easily apply the same customization without writing extra code.

module {
    func @top(%A: memref<64x32xi32>, %B: memref<32x64xi32>, %C: memref<64x64xi32>) -> memref<64x64xi32>
    {
        %s1 = hcl.create_stage_handle "s1" : !hcl.StageHandle
        %i1 = hcl.create_loop_handle "i1" : !hcl.LoopHandle
        %j1 = hcl.create_loop_handle "j1" : !hcl.LoopHandle
        %k1 = hcl.create_loop_handle "k1" : !hcl.LoopHandle
        // D = A * B
        %D = memref.alloc() : memref<64x64xi32>
        affine.for %i = 0 to 64 {
            affine.for %j = 0 to 64 {
                affine.for %k = 0 to 32 {
                    %a = affine.load %A[%i, %k] : memref<64x32xi32>
                    %b = affine.load %B[%k, %j] : memref<32x64xi32>
                    %c = affine.load %D[%i, %j] : memref<64x64xi32>
                    %prod = arith.muli %a, %b : i32
                    %sum = arith.addi %prod, %c: i32
                    affine.store %sum, %D[%i, %j] : memref<64x64xi32>
                } { loop_name = "k1" }
            } { loop_name = "j1" }
        } { loop_name = "i1", stage_name = "s1" }
        %s2 = hcl.create_stage_handle "s2" : !hcl.StageHandle
        %i2 = hcl.create_loop_handle "i2" : !hcl.LoopHandle
        %j2 = hcl.create_loop_handle "j2" : !hcl.LoopHandle
        %k2 = hcl.create_loop_handle "k2" : !hcl.LoopHandle
        // E = C * D
        %E = memref.alloc() : memref<64x64xi32>
        affine.for %i = 0 to 64 {
            affine.for %j = 0 to 64 {
                affine.for %k = 0 to 64 {
                    %c = affine.load %C[%i, %k] : memref<64x64xi32>
                    %d = affine.load %D[%k, %j] : memref<64x64xi32>
                    %e = affine.load %E[%i, %j] : memref<64x64xi32>
                    %prod = arith.muli %c, %d : i32
                    %sum = arith.addi %prod, %e: i32
                    affine.store %sum, %E[%i, %j] : memref<64x64xi32>
                } { loop_name = "k2" }
            } { loop_name = "j2" }
        } { loop_name = "i2", stage_name = "s2" }
        hcl.apply @gemm_opt(%A, %B, %D, %s1, %i1, %j1, %k1) : (memref<64x32xi32>, memref<32x64xi32>, memref<64x64xi32>, !hcl.StageHandle, !hcl.LoopHandle, !hcl.LoopHandle, !hcl.LoopHandle) -> ()
        hcl.apply @gemm_opt(%C, %D, %E, %s2, %i2, %j2, %k2) : (memref<64x64xi32>, memref<64x64xi32>, memref<64x64xi32>, !hcl.StageHandle, !hcl.LoopHandle, !hcl.LoopHandle, !hcl.LoopHandle) -> ()
        return %E : memref<64x64xi32>
    }
}

Later users can also add constraints to the customization to further specify what kind of kernels can use this optimization. Since the customization is a template, partial specialization can later be supported.

cornell-zhang / hcl-dialect

[Op] Parameterized Customization Template #105