daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

Additional/Reworked Codegen Passes #889

Closed AlexRTer closed 3 days ago

AlexRTer commented 3 weeks ago

This PR majorly reworks codegen for AllAgg* and EwOps as well as add lowering for TransposeOp and Row/ColAgg*. All of these passes are added to the optional MLIR codegen pipeline that can be enabled using the --mlir-codegen flag and offer alternative lowering of these operations to MLIR rather than calls to precompiled C++ kernels. Currently, they only support DenseMatrix with dimensions that are known at compile-time and any value type (except Booleans).

Except for IdxMin, IdxMax which are directly lowered to affine loops and TransposeOp which lowers to a named linalg op all passes make use of linalg GenericOps which are then lowered to affine loops in a later pass in the codegen pipeline. They convert the input DenseMatrix to a MemRef and create a new MemRef for the output that is converted into a DenseMatrix.

Changes:

Ops with new codegen:

A small example of a lowered kernel:

// ./bin/daphne --mlir-codegen *.daphne
X = [1, 2, 3, 4, 5, 6](2, 3);
print(sum(X, 0));               // sumRow

The input is converted to a MemRef and a result MemRef is allocated. The first Linalg GenericOp initialized the result MemRef by copying the first row of the input and the second GenericOp iterates over the remaining values and applies the aggregation operation - an addition in this case.

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0, 0)>
...
    %7 = "daphne.convertDenseMatrixToMemRef"(%6) : (!daphne.Matrix<2x3xsi64>) -> memref<2x3xsi64>
    %alloc = memref.alloc() : memref<2x1xsi64>
    %intptr = memref.extract_aligned_pointer_as_index %alloc : memref<2x1xsi64> -> index
    %8 = "daphne.convertMemRefToDenseMatrix"(%intptr, %c0, %c2, %c1, %c1, %c1) : (index, index, index, index, index, index) -> !daphne.Matrix<2x1xsi64>

%subview = memref.subview %7[0, 0] [2, 1] [1, 1] : memref<2x3xsi64> to memref<2x1xsi64, strided<[3, 1]>>
    linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%subview : memref<2x1xsi64, strided<[3, 1]>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      linalg.yield %in : si64
    }

%subview_0 = memref.subview %7[0, 1] [2, 2] [1, 1] : memref<2x3xsi64> to memref<2x2xsi64, strided<[3, 1], offset: 1>>
    linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%subview_0 : memref<2x2xsi64, strided<[3, 1], offset: 1>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      %9 = linalg.index 0 : index
      %10 = memref.load %alloc[%9, %c0] : memref<2x1xsi64>
      %11 = builtin.unrealized_conversion_cast %in : si64 to i64
      %12 = builtin.unrealized_conversion_cast %10 : si64 to i64
      %13 = arith.addi %12, %11 : i64
      %14 = builtin.unrealized_conversion_cast %13 : i64 to si64
      linalg.yield %14 : si64
    }
...

Known Limitations: