apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.54k stars 1.03k forks source link

[Epic] Unify `AggregateFunction` Interface (remove built in list of `AggregateFunction` s), improve the system #8708

Open alamb opened 6 months ago

alamb commented 6 months ago

Is your feature request related to a problem or challenge?

For many of the same reasons as listed on https://github.com/apache/arrow-datafusion/issues/8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:

  1. There are some features not available to User Defined Aggregate Functions (such as the faster GroupsAccumulator interface)
  2. Users can not easily choose which aggregate functions to include (for example, do they want to allow (and pay the code size / compile time) for the Statistical and Approximate functions

The second also ends up causing pushback on adding new aggregates like ARRAY_SUM in https://github.com/apache/arrow-datafusion/pull/8325 and geospatial support https://github.com/apache/arrow-datafusion/issues/7859.

Describe the solution you'd like

I propose moving DataFusion to only use AggregateUDFs and remove the built in list of AggregateFunctions for the same reasons as https://github.com/apache/arrow-datafusion/issues/8045

We will keep the existing AggregateUDF interface as much as possible, while also potentially providing an easier way to define them.

New AggregateUDF is in functions-aggregate crate Old Aggregate functions are in datafusion/physical-expr/src/aggregate

Describe alternatives you've considered

Additional context

Proposed implementation steps:

Move rust test to sqllogictest if possible #10384

Good first issue list

Pending

impl FromStr for AggregateFunction {
    type Err = DataFusionError;
    fn from_str(name: &str) -> Result<AggregateFunction> {
        Ok(match name {
            // general
            "avg" => AggregateFunction::Avg,
            "bit_and" => AggregateFunction::BitAnd,
            "bit_or" => AggregateFunction::BitOr,
            "bit_xor" => AggregateFunction::BitXor,
            "bool_and" => AggregateFunction::BoolAnd,
            "bool_or" => AggregateFunction::BoolOr,
            "max" => AggregateFunction::Max,
            "mean" => AggregateFunction::Avg,
            "min" => AggregateFunction::Min,
            "array_agg" => AggregateFunction::ArrayAgg,
            "nth_value" => AggregateFunction::NthValue,
            "string_agg" => AggregateFunction::StringAgg,
            // statistical
            "corr" => AggregateFunction::Correlation,
            // other
            "grouping" => AggregateFunction::Grouping,
            _ => {
                return plan_err!("There is no built-in function named {name}");
            }
        })
    }
}

Feel free to file an issue if you are interested in working on any of the above in the pending list.

jayzhan211 commented 3 months ago

I will work on FirstValue UDF together with #9249

jayzhan211 commented 3 months ago

Can we introduce state_fields and fields for AggregateUDFImpl. We can see that types in AggregateUDFImpl are for building fields, why not just return fields directly, we can not only define the types but also the field name and is_nullalbe.

https://github.com/apache/arrow-datafusion/blob/e337832946fc69f6ceccd14b96a071cc2cd4693d/datafusion/physical-plan/src/udaf.rs#L86-L106

alamb commented 3 months ago

Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week

jayzhan211 commented 3 months ago

Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week

I leave the comment so I won't forget. Have a good trip 😊

jayzhan211 commented 3 months ago

I have another question. I think our goal for the aggregate function is similar to functions. we will move them into separate crate. If we need to avoid importing physical-expr. It seems we need to move some struct from physical-expr to expr, like PhysicalExpr and pub type LexOrdering = Vec<PhysicalSortExpr>;

https://github.com/apache/arrow-datafusion/blob/3eeb108125b35424baac39dd20ba88433b347419/datafusion/physical-expr/src/aggregate/first_last.rs#L44-L45

But, does moving PhysicalExpr to datafusion_expr make sense? 😕 Where/When should convert expr to physical-expr like sort_exprs to LexOrdering. Should we go through the whole process in the aggregate-functions crate like what functions do or we should separate logical-expr and physical-expr for aggregate-functions and find a way to link between them (convert from logical-expr to physical-expr)?

alamb commented 1 month ago

After https://github.com/apache/datafusion/pull/10648 and https://github.com/apache/datafusion/issues/10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )

Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?

jayzhan211 commented 1 month ago

After #10648 and #10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )

Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?

Before this, I think it would be nice to determine the expression API of aggregate function first https://github.com/apache/datafusion/pull/10560#discussion_r1611644593