daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
65 stars 58 forks source link

Kernels for DenseMatrix of strings #415

Open akroviakov opened 2 years ago

akroviakov commented 2 years ago

With the introduction of DenseMatrix for strings, we need to support kernels for strings, as well as add new string-specific kernels. Specializing templates for strings may result in possibly severe code duplication. To address this issue, a conditional compilation (of a string-specific branch) within templates is preferred.

One of the main drawbacks in this regard of the current "naive" implementation is the usage of const char* to store strings, which requires heap allocation (and subsequent deletion) of temporaries before they can be set() in the result DenseMatrix.

Objectives:

  1. Introduce elementwise string-specific kernels: 1.1. Unary: ewLower, ewUpper 1.2. Binary: ewConcat, ewLike

  2. Extend existing kernels (not all of them, just try to find a good way to handle different classes of kernels) to support strings: 2.1. Output of new strings: fill, matrixConstant 2.2. Only output of existing strings: transpose 2.3. No output of strings: elementwise binary comparisons (lexicographical order) 2.4. Output of single string: aggregations

akroviakov commented 2 years ago

@pdamme When performing branching in AggRow.h, the warning about DCTX(ctx) argument is being displayed for the instantiation of AggRow<DenseMatrix<int>, DenseMatrix<const char*>> (i.e., IDXMAX/IDXMIN ops that are implemented in-place and the code for should not compile past these ops), is it allowed to add [[maybe_unused]] to that argument in apply()?

The reason the code should not compile past these ops is that when the compiler tries to compile the rest, the line with *valuesRes = agg; will fail, as it tries to perform the conversion from const char* to int, so we have to cut this branch and with that, the usage of ctx.

pdamme commented 2 years ago

I assume you mean you get a warning saying that ctx is unused in AggRow<...>::apply()? That surprises me a little, because we have many kernels that do not use ctx and don't cause such a warning. If it helps, feel free to add [[maybe_unused]].

Yes, I see that the code for some aggregation functions cannot be compiled for const char * as the input value type. I would be curious to know how you "cut" the branch.

akroviakov commented 2 years ago

For ewLike it makes sense to use std::regex. However then, in EwBinarySca.h, when using MAKE_EW_BINARY_SCA or a separate template specialization, where lhs is a string and rhs is a pattern, we have to face a choice:

The first case means a serious performance hit. The second one is more about the construction place of that regex and the dependency on std::regex for users of EwBinarySca.h. Moreover, since the main user of EwBinarySca is EwBinaryMat and the general Like operation expects inputs <DenseMatrix, columnIdx, pattern>, we'd need a new specialization of EwBinaryMat just for Like operation.

I already implemented a primitive form of Like as a simple kernel, should I pursue EwBinaryMat specialization or could a simple kernel be used for now?

pdamme commented 2 years ago

Very good observation regarding Like and regex. First of all, as commented on PR #418, Like should really be like any other elementwise binary comparison (no special kernel with a custom interface). In particular, we need at least:

All three variants could be used from SQL in DaphneDSL as well.

akroviakov commented 2 years ago

@pdamme I'm trying to implement like in DSL just like other ewOps and stumbled upon the following: I can use like in DSL for

x = "eHelloe" like "e%e";
print(x ? true : false);

but not for

x = "eHelloe" like "e%e";
print(x);

the error is /daphne/thirdparty/llvm-project/mlir/lib/IR/BuiltinAttributes.cpp:248: int64_t mlir::IntegerAttr::getInt() const: Assertion `(getType().isIndex() || getType().isSignlessInteger()) && "must be signless integer"' failed. Aborted (core dumped) This error can be fixed by adding utils.castBoolIf() in the visitor, though I'm not sure why is it needed, if folding returns bool anyways:

// DaphneDialect.cpp
mlir::OpFoldResult mlir::daphne::EwLikeOp::fold(ArrayRef<Attribute> operands) {
    ...
    if(operands[0].isa<StringAttr>() && operands[1].isa<StringAttr>()) {
    ...
        return mlir::BoolAttr::get(lhs.getContext(), std::regex_match(lhs.getValue().str(), toMatch));
    }

Then I tried to do x = [eHelloe] like "e%e";, but got:

JIT session error: Symbols not found: [ _cast__bool__DenseMatrix_char, _ewLike__DenseMatrix_char__DenseMatrix_char__char ]

Regardless of the cast, ewLike tries to return DenseMatrix of strings, and it is obvious, since Daphne_EwBinaryOp in DaphneOps.td does not differentiate for in/out types. So I tried to change it:

class Daphne_EwBinaryOp<string name, Type scalarTypeIN, Type scalarTypeOUT, list<OpTrait> traits = []>
   ...
    let arguments = (ins AnyTypeOf<[MatrixOf<[scalarTypeIN]>, scalarTypeIN]>:$lhs, AnyTypeOf<[MatrixOf<[scalarTypeIN]>, scalarTypeIN]>:$rhs);
    let results = (outs AnyTypeOf<[MatrixOf<[scalarTypeOUT]>, scalarTypeOUT]>:$res);
    let builders = [
        OpBuilder<(ins "Value":$lhs, "Value":$rhs), [{
            // TODO This is wrong if lhs is a scalar and rhs is a matrix.
            $_state.addTypes(lhs.getType());
            $_state.addOperands({lhs, rhs});
        }]>
    ];

and define ewLike as

def Daphne_EwLikeOp : Daphne_EwBinaryOp<"ewLike", StrScalar, IntScalar>;

Now I get

 error: 'daphne.ewLike' op result #0 must be matrix of integer or placeholder for an unknown type values or integer, but got '!daphne.Matrix<?x?x!daphne.String>'
module  {
  func @main() {
    %0 = "daphne.constant"() {value = 94092562866896 : ui64} : () -> ui64
    %1 = "daphne.constant"() {value = "e%e"} : () -> !daphne.String
    %2 = "daphne.constant"() {value = true} : () -> i1
    %3 = "daphne.constant"() {value = false} : () -> i1
    %4 = "daphne.matrixConstant"(%0) : (ui64) -> !daphne.Matrix<?x?x!daphne.String>
    %5 = "daphne.ewLike"(%4, %1) : (!daphne.Matrix<?x?x!daphne.String>, !daphne.String) -> !daphne.Matrix<?x?x!daphne.String>
    %6 = "daphne.cast"(%5) : (!daphne.Matrix<?x?x!daphne.String>) -> i1
    "daphne.print"(%6, %2, %3) : (i1, i1, i1) -> ()
    "daphne.return"() : () -> ()
  }
}

So ewLike still tries to return DenseMatrix of strings.

We can notice the line $_state.addTypes(lhs.getType()); in Daphne_EwBinaryOp, where we infer the result type, and that is why we see but got '!daphne.Matrix<?x?x!daphne.String>'.

One could leave the original Daphne_EwBinaryOp untouched (i.e., no explicit return type), copy its body to Daphne_EwLikeOp and change it, so that it operates on fixed types. But it seems that I cannot easily change the argument of $_state.addTypes(lhs.getType()); to anything, other than what I have (lhs/rhs).

What would be your advise on how to explicitly specify the return type in this situation?

akroviakov commented 2 years ago

Alright, the first issue above was due to the EwLikeOp::fold() returning BoolAttr, instead of IntegerAttr.

I couldn't find a way to solve the second issue in a general Daphne_EwBinaryOp manner, so for now it is similar to Daphne_EwBinaryOp, but without type inference (there's nothing to infer). I shifted the res type specification to DaphneDSLVisitor.cpp, so now the call looks like

    if(op == "like"){
        if(lhs.getType().isa<mlir::daphne::MatrixType>())
            return static_cast<mlir::Value>(builder.create<mlir::daphne::EwLikeOp>(loc, utils.matrixOf(resType), lhs, rhs));
        else
            return static_cast<mlir::Value>(builder.create<mlir::daphne::EwLikeOp>(loc, resType, lhs, rhs));
    }

Would be interesting to know if I've missed some way to implement it in a (more) general manner.

pdamme commented 2 years ago

As you have already found out, the problem is connected to type inferencem, but also to type constraints. It should be fixed in a pragmatic way for this PR, since I'm refactoring type inference in another branch anyway. For some background: the definition of a op in DaphneOps.td, e.g., def Daphne_EwLikeOp : Daphne_EwBinaryOp<"ewLike", StrScalar, IntScalar>; merely defines constraints on the input/output types (besides other things), but doesn't do the inference. The constraints are automatically checked at various points in time, and any type inference must respect these constraints.

Currently, EwBinaryOp expects the same value type scalarType for its inputs and output (DaphneOps.td line 214f). That needs to be generalized to support different value types. Once done, it should be possible to define EwLikeOp as a EwBinaryOp (or even better EwCmpOp), and we could save some of the extra interface implementations for EwLikeOp.

Furhermore, I recommend adapting the type inference in the custom builder in DaphneOps.td line 217ff (on your branch)

    let builders = [
        OpBuilder<(ins "Value":$lhs, "Value":$rhs), [{
            // TODO This is wrong if lhs is a scalar and rhs is a matrix.
            $_state.addTypes(lhs.getType());
            $_state.addOperands({lhs, rhs});
        }]>
    ];

Inside the code, you can access an mlir::Builder by $_builder. For instance, you could say $_state.addTypes($_builder.getIntegerType(64, true));, if the inputs are (matrices of) strings (see https://mlir.llvm.org/docs/OpDefinitions/#custom-builder-methods, https://mlir.llvm.org/doxygen/classmlir_1_1Builder.html).

Hope that helps.

Side note regarding constant folding: Constant folding happens at certain stages in the compilation chain, but we already have IR before that, and it must not violate constraints, e.g., on the input/output types. Generally, it should also work without folding, because there can be cases where the inputs are not compile-time constants.