apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.55k forks source link

[C++][Gandiva] Add parser frontend for Gandiva #32908

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Background

My team uses an expression computation library for our C++ feature engineering pipeline. We currently use Exprtk. We recently tried out Gandiva and wrote some benchmarks. We discovered that Gandiva is several times faster than Exprtk in our use cases. Therefore we decided to switch to Gandiva for computing expressions.

Objective

As of current, due to its lack of a frontend, we need to manually construct an AST to use Gandiva. This is inconvenient and requires extra learning costs. We also want to enable our ML engineers to dynamically create/update an expression with runtime hot-loaded configs without restarting our server. This is currently impossible with Gandiva because the Expression tree is statically created with C++ and must be compiled in the server binary.

Therefore, we would like to implement a parser frontend for Gandiva, so that Gandiva becomes a standalone complete expression compiler and evaluator, and a drop-in replacement for the existing libraries like Exprtk and TinyExpr. The goal is to enable the following functionality:

 

``


// Create schema for gandiva
auto field_x = arrow::field("x", arrow::uint64());
auto field_y = arrow::field("y", arrow::float64());
auto field_z = arrow::field("z", arrow::boolean());
auto schema = arrow::schema({field_x, field_y, field_z});

/** BEGIN CHANGE **/
// Use the Parser to generate a NodePtr
std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
auto parser = gandiva::Parser(schema);
gandiva::NodePtr root_node;
auto status = parser.Parse(expr_str, &root_node);
/** END CHANGE **/

// The rest is normal usage of Gandiva projector
auto expr_ptr = std::make_shared<gandiva::Expression>(root_node, result_field);
std::shared_ptr<gandiva::Projector> projector;
auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);

auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
auto* pool = arrow::default_memory_pool();
arrow::ArrayVector outputs;
auto status = projector->Evaluate(*in_batch, pool, &outputs);

 

The code block enclosed by “BEGIN CHANGE” and “END CHANGE” is the proposed usage of the parser. It offers two benefits:

  1. It’s more intuitive to write math expressions compared to constructing trees, thus easier to use.
  2. It allows dynamically adding new expressions or and changing existing ones with a runtime hot-loaded config file without restarting our server.

    Syntax

    The goal is to design a succinct and intuitive grammar for both schema and Gandiva expressions. We will need a corresponding grammar for each Node type in Gandiva.

    • Literals: We find Rust’s literal representation(https://doc.rust-lang.org/rust-by-example/types/literals.html) very intuitive. We’ll support suffixes such as {}i32{}, {}u64{}, f32 to denote a literal node’s type. The types of unsuffixed literals are inferred by their usage. Otherwise, the default type for integers is int32 and float32 for floating points. String and binary literals are wrapped with single or double quotes. Decimal128 literals will not be supported in the first version.

Reporter: Jin Shang / @js8544

Subtasks:

Note: This issue was originally created as ARROW-17668. Please see the migration documentation for further details.

asfimport commented 2 years ago

Kouhei Sutou / @kou: Could you post this proposal to dev@arrow.apache.org to get more feedback?

asfimport commented 1 year ago

Apache Arrow JIRA Bot: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.