facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Adapt leafCallToSubfieldFilter to Spark #11093

Open rui-mo opened 1 month ago

rui-mo commented 1 month ago

Description

Gluten implemented some logic to convert call expr as subfield filter. To avoid duplication, we would like to use the existing 'leafCallToSubfieldFilter' logic in Velox. One incompatibility we found is current implementation is matching against Presto function names, while Spark function names could be different. The other one is 'isnotnull' is frequently used in Spark and it is not currently supported. These are the drafted changes we would like to propose: https://github.com/rui-mo/velox/commit/73409033e5760a5c664c57adc282095910f257f6.

Yuhta commented 1 month ago

Yeah leafCallToSubfieldFilter is Presto specific (similar to Tokenizer). Mixing in Spark names probably is not a good way to do it. A related question is how do we parse this expression in ExprCompiler, do we have a isnotnull function in sparksql?

Yuhta commented 1 month ago

One way to fix it is to have

class ExprToSubfieldFilterParser {
  std::unique_ptr<common::Filter> leafCallToSubfieldFilter(
    const core::CallTypedExpr&,
    common::Subfield&,
    core::ExpressionEvaluator*,
    bool negated = false);
};

And have this stored as shared_ptr inside ExecCtx, or another context object with query scope.

rui-mo commented 3 weeks ago

@Yuhta Thanks for your suggestion. I took further look and these are my findings.

To pass the instance of 'ExprToSubfieldFilterParser' through ctx, the workflow would be:

  1. In Gluten, we provide an customized instance to the queryCtx. In Velox, get queryCtx from task in TableScan.
  2. Pass the instance through 'createDataSource'.
  3. In HiveDataSource, pass the instance to 'extractFiltersFromRemainingFilter' and 'MetadataFilter'.

Another option I can think of is to provide a parser factory which allows the registration of customized filter parser. This way would be lighter than the first one as it does not bring API change. What are your thoughts? Thanks!

Yuhta commented 3 weeks ago

Yes we can use a global factory for now until someone wants to have multiple engines in the same process (unlikely in short term).