Open asfimport opened 2 years ago
Kouhei Sutou / @kou:
Could you post this proposal to dev@arrow.apache.org
to get more feedback?
Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.
Background
My team uses an expression computation library for our C++ feature engineering pipeline. We currently use Exprtk. We recently tried out Gandiva and wrote some benchmarks. We discovered that Gandiva is several times faster than Exprtk in our use cases. Therefore we decided to switch to Gandiva for computing expressions.
Objective
As of current, due to its lack of a frontend, we need to manually construct an AST to use Gandiva. This is inconvenient and requires extra learning costs. We also want to enable our ML engineers to dynamically create/update an expression with runtime hot-loaded configs without restarting our server. This is currently impossible with Gandiva because the Expression tree is statically created with C++ and must be compiled in the server binary.
Therefore, we would like to implement a parser frontend for Gandiva, so that Gandiva becomes a standalone complete expression compiler and evaluator, and a drop-in replacement for the existing libraries like Exprtk and TinyExpr. The goal is to enable the following functionality:
``
The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the proposed usage of the parser. It offers two benefits:
It allows dynamically adding new expressions or and changing existing ones with a runtime hot-loaded config file without restarting our server.
Syntax
The goal is to design a succinct and intuitive grammar for both schema and Gandiva expressions. We will need a corresponding grammar for each Node type in Gandiva.
{}i32{
},{}u64{
},f32
to denote a literal node’s type. The types of unsuffixed literals are inferred by their usage. Otherwise, the default type for integers isint32
andfloat32
for floating points. String and binary literals are wrapped with single or double quotes. Decimal128 literals will not be supported in the first version.Fields: Just their names as defined in the schema. To avoid conflicts with other node types, field names must start with alphabetical letters.
Functions:
{}<function_name>(<param1>, <param2>, ...){
}. For functions with multiple overloads, their return type is inferred from input types. For commonly used functions, we would also like to support infix forms. They include:Ifs: We would like to support two grammars for if expressions:
if(<cond>, <then>, <else>)
for its simplicity and functional feel;if(<cond>) \{ <then> } else \{ <else>
} since it’s the C++if
grammar and has better formatting for complex expressions.Booleans: We would like to support both
&& ||
andand or
keywords the same as C++.InExpressions:
<eval> in (<member1>, <member2>, ...)
. Its type is also inferred.The grammar can be roughly represented as:
``
lower cases are non-terminals and upper cases are tokens.
Implementation
Lexing and Parsing
We would like to use flex and bison for lexing and parsing. They have several advantages compared to other options like Antlr4 and Boost::Spirit.
Flex&bison takes a string and outputs a
gandiva::node
tree. The tree may contain untyped nodes, e.g., unsuffixed literals and functions.Type Inference
We’ll have a TypeInferenceVisitor class that inherits node visitor, implementing a 2-pass DFS algorithm to do type inference. In each pass, the visitor tries to infer current node’s and its children’s types from currently available info:
If it’s a non-leaf node such as
function
,{}if{
} ,{}boolean{
}:SignaturePattern
based on currently known types. The pattern includes the current node’s type and children’s types. The types can benullptr
meaning the type is currently unknown. For example, theSignaturePattern
forfunc(x: u64, 5: untyped)
will be(u64, nullptr)—>nullptr.
{}function{
}s, it’s the signatures registered at the function registry. For{}if{
}, it’s(bool, <type>, <type>)—><type>
for any type etc.SignaturePattern
If more than one matches, we extract a common pattern from the set of matched signatures. For example,
(nullptr, nullptr)—>bool
can be extracted from(double, double)—>bool
and(int32, int32)—>bool
. We then update types based on this common pattern.If it’s a leaf node, just update its value if its type is set by its parent. We need to do this because untyped literals’ values are saved in a generic container.
We run this procedure 2 times, with the second pass a bit different. In the second pass, if a literal node is still untyped, give it a default type (
{}int32{
} for ints andfloat32
for floats).bottom up of default literal types. A literal is given a default type.
In the second pass, since all leaf nodes’ types are known. The types of all nodes are guaranteed to be inferred.
Proof of correctness
I have a proof of correctness of this inference procedure in mind, but it’s too long to be written down here. (Fermat did it too :)) But the correctness is based on these two facts:
func := (int32, int32)-->int32
andfunc := (int32, int32)-->double
in the Gandiva registry. Therefore we can always know a function’s type if its children’s types are known.func := (int16, int16)-->int16
and{}func := (int64, int64)-->int64{
}, but no{}func := (int32, int32)-->int32{
}, then the inference procedure will fail on expressionfunc(1, 2)
because it cannot tell the types of1
and{}2{
}, and giving them default types won't work.Prototype and Examples
I’ve already written a mostly working prototype of the parser and type inference visitor and unit tests for them. They are on the branch https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference.
You can checkout the diffs here: https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference
The main files are:
cmake .. --preset=ninja-debug-gandiva
and{}ninja test-gandiva-tests{
}.Any suggestion/question is appreciated!
Reporter: Jin Shang / @js8544
Subtasks:
PRs and other links:
Note: This issue was originally created as ARROW-17668. Please see the migration documentation for further details.