apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.87k stars 1.11k forks source link

Review use of logical expressions in physical AggregateFunctionExpr #11359

Closed andygrove closed 4 weeks ago

andygrove commented 1 month ago

Is your feature request related to a problem or challenge?

DataFusion 40.0.0 added a new logical_args: Vec<Expr> field to AggregateFunctionExpr, which seems confusing, and there is no documentation in this struct that explains what this field is used for.

In DataFusion Comet, we do not use DataFusion's logical plan or expressions because we are translating an Apache Spark physical plan into a DataFusion physical plan and therefore we have no logical expressions to pass into this new field.

I think at a minimum we should add some documentation around this new feature.

jayzhan211 commented 1 month ago

@ozankabak had asked for whether there is anyway to entirely getting rid of logical expressions in discord, so I think we can review about the challenge I had before.

The reason why there are logical expressions in creeate_aggregate_expr is for customizing Accumulator. We can have different kind of accumulator based on the function's arguments. Ideally we should check with PhysicalExpr, but unfortunately, given the current design, we are not able to introduce physical expressions in AggregateUDFImpl trait, because we want to avoid adding dependency of physical-expr crate to datafusion-expr.

I think that it is also the main reason that blocking us. I propose to redesign about the role of each crate. To able to deal with physical concept for AggregateUDFImpl, we break physical-expr into two level. One is physical-expr-common, and another is physical-expr. The same applies to logical expr, datafusion-expr-common and datafusion-expr. common crate is the higher level crate so that we can add the dependency of physical-expr-common into datafusion-expr

The crate graph is like

  graph TD;
      functions-aggregates-common-->expr-common;
      functions-aggregates-common-->physical-expr-common;
      functions-aggregates-->functions-aggregates-common;
      functions-aggregates-->expr;
      physical-expr-common-->expr-common;
      expr-->expr-common;
      expr-->functions-aggregates-common;
      physical-expr-->physical-expr-common;
      core-->functions-aggregates;
      core-->physical-expr;
      third-parties-aggregate --> functions-aggregates-common;

expr-common: Things other than Expr and LogicalPlan can place it here expr: Mainly for Expr and LogicalPlan. Import functions-aggregate-common for UDAF. physical-common: Physical expr trait or other common things. Similar to what it is now except physical-expr like Column, Cast, Literal which should be moved to physical-expr physical-expr: Physical Expr are here + Column, Cast, Literal functions-aggregate-common: Import physical-common and expr-common, for other users to build their own udaf. functions-aggregate: datafusion builtin functions

The more detail of discussion before is in https://github.com/apache/datafusion/issues/10074

With this approach, function like limited_convert_logical_expr_to_physical_expr is no longer needed after this change.

@alamb I think we can review about this idea again, the previous concern is that

What I am worried about is that physical-expr-common would end up with all the code from physical-expr

I think it is not an issue anymore.

ozankabak commented 1 month ago

With this approach, function like limited_convert_logical_expr_to_physical_expr is no longer needed after this change.

We have at least two such functions and it would be great to arrive at a design that eliminates such functions.

Conceptually, we should have more info at the physical level (relative to logical level), so a refactor that makes AggregateFunctionExpr and likewise structs independent of logicals should be possible. I get the feeling that our dependency structure is currently more complex than it needs to be.

jayzhan211 commented 1 month ago

I get the feeling that our dependency structure is currently more complex than it needs to be.

I think we tend to split crate aggressively so others can only import the necessary crate they need.

ozankabak commented 1 month ago

Splitting crates when beneficial is great, maybe we haven't arrived at the best design in terms of structure yet (as evidenced by this issue).

I think we can use your previous work as a starting point to improve the structure and also resolve this current issue 🚀

alamb commented 1 month ago

I agree with @jayzhan211 that the core of the problem is that the user defined API for aggregates is in datafusion_expr so can only use Expr but is invoked / instantiated as part of the physical plan.

However, the same basic problem could be claimed for ScalarUDFImpl and WindowUDFImpl 🤔

It does feel like the way out of this is to make the spit between API and implementation more explicit.

Maybe instead of expr-common maybe we could call it expr-api and physical-common --> physical-api

So like

alamb commented 1 month ago

I agree we should document what the fields are used for now

I personally recommend we finish #8708 before we try to do some other crate refactor. We are close with that one and once we have all the aggregates going through the same APIs I think we'll be in a better position to split things apart

jayzhan211 commented 1 month ago

Cool, maybe I could think about pulling down functions trait from expr instead of pulling up common things to expr-common 🤔

jayzhan211 commented 1 month ago

I think it is possible to have expr-api crate that contains ContextProvider, TableSource, FunctionRegistry, AggregateUDF, WindowUDF, and maybe ScalarUDF (this is not necessary but it would be nice to be close to other two functions)

physical-api: Physical expr trait or other common things. Similar to what it is now except physical-expr like Column, Cast, Literal which should be moved to physical-expr physical-expr-common Physical Expr are here + Column, Cast, Literal

Not sure about the reason for physical-expr-common crate that contains Physical Expr, that is almost 60 ~ 80 % of the things in physical-expr, I perfer moving them back 🤔

physical-api / physical-expr-common for PhysicalExpr and PhysicalSortExpr and utils function for physical expr.

The original idea of functions-aggregate-common is crate for AggregateUDFImpl, we can

  1. Keep ContextProvider, FunctionRegistry, TableSource and UDF,UDAF,UDWF in the same crate. The dependency is expr-api -> expr
  2. UDF,UDAF,UDWF in expr-functions and move ContextProvider, FunctionRegistry and TableSource to a crate like expr-provider / expr-registry. The dependency is expr-provider -> expr-functions -> expr

The crate graph is like

  graph TD;
      expr-api --> physical-expr-common;
      physical-expr-common --> expr;
      functions-aggregate -->  expr-api;
      third-parties-aggregate --> expr-api;

or

  graph TD;
      expr-provider --> expr-functions;
      expr-functions --> physical-expr-common;
      physical-expr-common --> expr;
      functions-aggregate -->  expr-functions;
      third-parties-aggregate --> expr-functions;
jayzhan211 commented 1 month ago

~I plan to pull out aggregate function to functions-aggregate-common (the role of expr-functions in the above graph but narrow to aggregate only) as the first step~. Ok, this doesn't work 😆 , AggregateFunction and Expr should belong to the same crate.

Expr::AggregateFunction(AggregateFunction)

pub struct AggregateFunction {
    /// Name of the function
    pub func: Arc<crate::AggregateUDF>,
   ..
}

pub struct AggregateUDF {
    inner: Arc<dyn AggregateUDFImpl>,
}

I guess #10327 is the only possible solution

jayzhan211 commented 1 month ago

I found AggregateUDFImpl is now tightly coupled with Expr unlike the status I got in #10327, thus not possible to eliminate the dependency of Expr. Mainly due to fn call() -> Expr

The crate graph is now like

Since we would like to import PhysicalExpr for AggregateUDFImpl, and sadly it seems to be tightly coupled with Expr.

Therefore, we come out several common-level crate, those crate has no dependencies on Expr. physical-expr-common are common things about physical-expr concept. Mainly PhysicalExpr, PhysicalSortExpr. Their dependencies are pull out from expr to expr-common. functions-aggregate-common are more aggregate-specific

Physical expr like Column, Cast, Literal are moved back to physical-expr for now. AggregateFunctionExpr, AggregateExprBuilder are moved to physical-expr-functions-aggregate, they have depdency on Expr. It's level is similar to physical-expr, but more aggregate specific.

  graph TD;
      physical-expr-common --> expr-common;
      functions-aggregate-common --> physical-expr-common;
      expr --> functions-aggregate-common;
      physical-expr-functions-aggregate --> expr;
      physical-expr --> expr;
      functions-aggregate --> physical-expr;