Additional/Reworked Codegen Passes

This PR majorly reworks codegen for AllAgg* and EwOps as well as add lowering for TransposeOp and Row/ColAgg*. All of these passes are added to the optional MLIR codegen pipeline that can be enabled using the --mlir-codegen flag and offer alternative lowering of these operations to MLIR rather than calls to precompiled C++ kernels. Currently, they only support DenseMatrix with dimensions that are known at compile-time and any value type (except Booleans).

Except for IdxMin, IdxMax which are directly lowered to affine loops and TransposeOp which lowers to a named linalg op all passes make use of linalg GenericOps which are then lowered to affine loops in a later pass in the codegen pipeline. They convert the input DenseMatrix to a MemRef and create a new MemRef for the output that is converted into a DenseMatrix.

Changes:

Add codegen for AllAgg*Op, Row/ColAgg*Op, Ew*Op and TransposeOp (see below for details)
- Added passes to TableGen files and codegen pipeline
Added script level test cases / MLIR test cases (using FileCheck)
- Replaced old tests
- Renamed some old test scripts for EwOps for better organization
- Edited fusion.mlir test to lower Linalg to affine loops before applying fusion pass
Added Canonicalization passes for floor, ceil, round that removes the respective ops when input type is an integer (this also simplifies codegen)
Added some necessary instantiations in kernels.json
Restored alphabetic sorting of codegen passes in ir/daphneir/Passes.h

Ops with new codegen:

AllAgg*Op
- Sum, Min, Max
Row/ColAgg*Op
- Sum, Min, Max, IdxMin, IdxMax
Ew*Op
- Unary (scalar/matrix): Abs, Sqrt, Exp, Ln, Sin, Cos, Floor, Ceil, Round
- Binary (scalar-scalar/matrix-matrix/matrix-scalar broadcasting): Add, Sub, Mul, Div, Pow, Max, Min
TransposeOp

A small example of a lowered kernel:

// ./bin/daphne --mlir-codegen *.daphne
X = [1, 2, 3, 4, 5, 6](2, 3);
print(sum(X, 0));               // sumRow

The input is converted to a MemRef and a result MemRef is allocated. The first Linalg GenericOp initialized the result MemRef by copying the first row of the input and the second GenericOp iterates over the remaining values and applies the aggregation operation - an addition in this case.

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0, 0)>
...
    %7 = "daphne.convertDenseMatrixToMemRef"(%6) : (!daphne.Matrix<2x3xsi64>) -> memref<2x3xsi64>
    %alloc = memref.alloc() : memref<2x1xsi64>
    %intptr = memref.extract_aligned_pointer_as_index %alloc : memref<2x1xsi64> -> index
    %8 = "daphne.convertMemRefToDenseMatrix"(%intptr, %c0, %c2, %c1, %c1, %c1) : (index, index, index, index, index, index) -> !daphne.Matrix<2x1xsi64>

%subview = memref.subview %7[0, 0] [2, 1] [1, 1] : memref<2x3xsi64> to memref<2x1xsi64, strided<[3, 1]>>
    linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%subview : memref<2x1xsi64, strided<[3, 1]>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      linalg.yield %in : si64
    }

%subview_0 = memref.subview %7[0, 1] [2, 2] [1, 1] : memref<2x3xsi64> to memref<2x2xsi64, strided<[3, 1], offset: 1>>
    linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%subview_0 : memref<2x2xsi64, strided<[3, 1], offset: 1>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      %9 = linalg.index 0 : index
      %10 = memref.load %alloc[%9, %c0] : memref<2x1xsi64>
      %11 = builtin.unrealized_conversion_cast %in : si64 to i64
      %12 = builtin.unrealized_conversion_cast %10 : si64 to i64
      %13 = arith.addi %12, %11 : i64
      %14 = builtin.unrealized_conversion_cast %13 : i64 to si64
      linalg.yield %14 : si64
    }
...

Known Limitations:

Moving the LoopFusionPass below the LinalgToAffineLoopsPass enables some loop fusions already, but it seems to cause issues with e.g. TransposeOp. A simple example of this is X = [1,2,3](1,); print(t(X)); print(t(t(X)));. Hence, loop fusion has not been moved down yet.
Ew*Op broadcasting for singleton matrices currently has no canonicalizer pass to always move the singleton matrix to be the rhs operand. This should be handled separately though to take broadcasting for C++ kernels into account as well. (see #803)
Dimensions for codegen Ops currently need to be known at compile-time. This is due to the way MemRefType is currently handled during conversion of the input Dense Matrix to a MemRef.
RewriteToCallKernelOpPass currently fails if IR contains math.ipowi or any trigonometric math op other than sin and cos, e.g. no kernels registered for operation 'ipowi'. Hence, the ewBinaryPow test currently fails. Before merging this should be fixed or commented out. The same issue persists for the currently commented out lowering for trigonometric math ops tan, asin, acos, atan, sinh, cosh, tanh in EwOpsLowering.cpp.

daphne-eu / daphne

Additional/Reworked Codegen Passes #889