[C++][Gandiva] Add parser frontend for Gandiva

asfimport commented 2 years ago

Background

My team uses an expression computation library for our C++ feature engineering pipeline. We currently use Exprtk. We recently tried out Gandiva and wrote some benchmarks. We discovered that Gandiva is several times faster than Exprtk in our use cases. Therefore we decided to switch to Gandiva for computing expressions.

Objective

As of current, due to its lack of a frontend, we need to manually construct an AST to use Gandiva. This is inconvenient and requires extra learning costs. We also want to enable our ML engineers to dynamically create/update an expression with runtime hot-loaded configs without restarting our server. This is currently impossible with Gandiva because the Expression tree is statically created with C++ and must be compiled in the server binary.

Therefore, we would like to implement a parser frontend for Gandiva, so that Gandiva becomes a standalone complete expression compiler and evaluator, and a drop-in replacement for the existing libraries like Exprtk and TinyExpr. The goal is to enable the following functionality:

``


// Create schema for gandiva
auto field_x = arrow::field("x", arrow::uint64());
auto field_y = arrow::field("y", arrow::float64());
auto field_z = arrow::field("z", arrow::boolean());
auto schema = arrow::schema({field_x, field_y, field_z});

/** BEGIN CHANGE **/
// Use the Parser to generate a NodePtr
std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
auto parser = gandiva::Parser(schema);
gandiva::NodePtr root_node;
auto status = parser.Parse(expr_str, &root_node);
/** END CHANGE **/

// The rest is normal usage of Gandiva projector
auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, result_field);
std::shared_ptr<gandiva::Projector> projector;
auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);

auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
auto* pool = arrow::default_memory_pool();
arrow::ArrayVector outputs;
auto status = projector->Evaluate(*in_batch, pool, &outputs);

The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the proposed usage of the parser. It offers two benefits:

It’s more intuitive to write math expressions compared to constructing trees, thus easier to use.
It allows dynamically adding new expressions or and changing existing ones with a runtime hot-loaded config file without restarting our server.

Syntax

The goal is to design a succinct and intuitive grammar for both schema and Gandiva expressions. We will need a corresponding grammar for each Node type in Gandiva.
- Literals: We find Rust’s literal representation(https://doc.rust-lang.org/rust-by-example/types/literals.html) very intuitive. We’ll support suffixes such as {}i32{}, {}u64{}, f32 to denote a literal node’s type. The types of unsuffixed literals are inferred by their usage. Otherwise, the default type for integers is int32 and float32 for floating points. String and binary literals are wrapped with single or double quotes. Decimal128 literals will not be supported in the first version.

Fields: Just their names as defined in the schema. To avoid conflicts with other node types, field names must start with alphabetical letters.
Functions: {}<function_name>(<param1>, <param2>, ...){}. For functions with multiple overloads, their return type is inferred from input types. For commonly used functions, we would also like to support infix forms. They include:
- Comparisons: equal(==), not equal(!=), greater than(>), greater than or equal to(>=), less than(<), less than or equal to(<=)
- Arithmetics: add(+), subtract(-), multiply(*), divide(/), modulo(%), power(^{), bitwise and(&), bitwise or(|), bitwise xor(}), bitwise not(~) Function aliases with spaces in their names won’t be supported such as “is not false” are not supported.
Ifs: We would like to support two grammars for if expressions:
- if(<cond>, <then>, <else>) for its simplicity and functional feel;
- if(<cond>) \{ <then> } else \{ <else>} since it’s the C++ if grammar and has better formatting for complex expressions.
Booleans: We would like to support both && || and and or keywords the same as C++.
InExpressions: <eval> in (<member1>, <member2>, ...) . Its type is also inferred.

The grammar can be roughly represented as:

``
```
// Create schema for gandiva
auto field_x = arrow::field("x", arrow::uint64());
auto field_y = arrow::field("y", arrow::float64());
auto field_z = arrow::field("z", arrow::boolean());
auto schema = arrow::schema({field_x, field_y, field_z});

/** BEGIN CHANGE **/
// Use the Parser to generate a NodePtr
std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
auto parser = gandiva::Parser(schema);
gandiva::NodePtr root_node;
auto status = parser.Parse(expr_str, &root_node);
/** END CHANGE **/

// The rest is normal usage of Gandiva projector
auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, result_field);
std::shared_ptr<gandiva::Projector> projector;
auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);

auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
auto* pool = arrow::default_memory_pool();
arrow::ArrayVector outputs;
auto status = projector->Evaluate(*in_batch, pool, &outputs);
```
lower cases are non-terminals and upper cases are tokens.

Implementation

Lexing and Parsing

We would like to use flex and bison for lexing and parsing. They have several advantages compared to other options like Antlr4 and Boost::Spirit.
1. They are the most classical and popular parsing library in the cpp world.
2. They allow us to precompile the parser codes and have no runtime dependencies.
Flex&bison takes a string and outputs a gandiva::node tree. The tree may contain untyped nodes, e.g., unsuffixed literals and functions.

Type Inference

We’ll have a TypeInferenceVisitor class that inherits node visitor, implementing a 2-pass DFS algorithm to do type inference. In each pass, the visitor tries to infer current node’s and its children’s types from currently available info:

If it’s a non-leaf node such as function ,{}if{} ,{}boolean{}:
1. First visit each child: let them infer their own types as much as they can.
2. Create a SignaturePattern based on currently known types. The pattern includes the current node’s type and children’s types. The types can be nullptr meaning the type is currently unknown. For example, the SignaturePattern for func(x: u64, 5: untyped) will be (u64, nullptr)—>nullptr.
3. Get all available signatures of the current node. For {}function{}s, it’s the signatures registered at the function registry. For {}if{}, it’s (bool, <type>, <type>)—><type> for any type etc.
4. Try to match each signature with the current pattern SignaturePattern
5. If no one matches, it’s a type error.
6. If only one matches, it’s the matched signature. We then update current node’s and their children’s types based on this signature.
7. If more than one matches, we extract a common pattern from the set of matched signatures. For example, (nullptr, nullptr)—>bool can be extracted from (double, double)—>bool and (int32, int32)—>bool . We then update types based on this common pattern.
  
  If it’s a leaf node, just update its value if its type is set by its parent. We need to do this because untyped literals’ values are saved in a generic container.
  
  We run this procedure 2 times, with the second pass a bit different. In the second pass, if a literal node is still untyped, give it a default type ({}int32{} for ints and float32 for floats).
8. The first pass is bottom up propagation of types. A parent’s type is inferred from its children.
9. The second pass is both:
10. top down propagation of first pass’ info. Once a parent’s type is known in the first pass, it may set it’s children’s type
11. bottom up of default literal types. A literal is given a default type.
  
  In the second pass, since all leaf nodes’ types are known. The types of all nodes are guaranteed to be inferred.
  
  Proof of correctness
  
  I have a proof of correctness of this inference procedure in mind, but it’s too long to be written down here. (Fermat did it too :)) But the correctness is based on these two facts:
12. There are no overloads that differ only on their return types. E.g. there are no func := (int32, int32)-->int32 and func := (int32, int32)-->double in the Gandiva registry. Therefore we can always know a function’s type if its children’s types are known.
13. There are no overloads that accept only non-default numeric types. For example, if there is a func := (int16, int16)-->int16 and {}func := (int64, int64)-->int64{}, but no {}func := (int32, int32)-->int32{}, then the inference procedure will fail on expression func(1, 2) because it cannot tell the types of 1 and {}2{}, and giving them default types won't work.
Prototype and Examples

I’ve already written a mostly working prototype of the parser and type inference visitor and unit tests for them. They are on the branch https://github.com/js8544/arrow/tree/jinshang/gandiva/type_inference.

You can checkout the diffs here: https://github.com/apache/arrow/compare/master...js8544:arrow:jinshang/gandiva/type_inference

The main files are:
1. cpp/src/gandiva/grammar.yy: grammar rules for Bison.
2. cpp/src/gandiva/lex.ll: lex rules for Flex.
3. cpp/src/gandiva/typeinference.h/cc: type inference procedure.
4. cpp/src/gandiva/parser.cc: the driver class that combines the three components.
5. cpp/src/gandiva/parser_test.cc: unit tests containing examples of the proposed syntax and the result expression trees the parser generates. You can run the tests by running cmake .. --preset=ninja-debug-gandiva and {}ninja test-gandiva-tests{}.
Any suggestion/question is appreciated!

Reporter: Jin Shang / @js8544

Subtasks:

_{Note: This issue was originally created as ARROW-17668. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Kouhei Sutou / @kou: Could you post this proposal to dev@arrow.apache.org to get more feedback?

asfimport commented 1 year ago

Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

apache / arrow