Open alamb opened 11 months ago
I will work on FirstValue UDF together with #9249
Can we introduce state_fields
and fields
for AggregateUDFImpl
. We can see that types in AggregateUDFImpl
are for building fields, why not just return fields directly, we can not only define the types but also the field name
and is_nullalbe
.
Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week
Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week
I leave the comment so I won't forget. Have a good trip 😊
I have another question. I think our goal for the aggregate function is similar to functions
. we will move them into separate crate. If we need to avoid importing physical-expr
. It seems we need to move some struct from physical-expr
to expr
, like PhysicalExpr
and pub type LexOrdering = Vec<PhysicalSortExpr>;
But, does moving PhysicalExpr
to datafusion_expr
make sense? 😕 Where/When should convert expr
to physical-expr
like sort_exprs
to LexOrdering
. Should we go through the whole process in the aggregate-functions
crate like what functions
do or we should separate logical-expr
and physical-expr
for aggregate-functions
and find a way to link between them (convert from logical-expr to physical-expr)?
After https://github.com/apache/datafusion/pull/10648 and https://github.com/apache/datafusion/issues/10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )
Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?
After #10648 and #10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )
Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?
Before this, I think it would be nice to determine the expression API of aggregate function first https://github.com/apache/datafusion/pull/10560#discussion_r1611644593
Is your feature request related to a problem or challenge?
For many of the same reasons as listed on https://github.com/apache/arrow-datafusion/issues/8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:
GroupsAccumulator
interface)The second also ends up causing pushback on adding new aggregates like
ARRAY_SUM
in https://github.com/apache/arrow-datafusion/pull/8325 and geospatial support https://github.com/apache/arrow-datafusion/issues/7859.Describe the solution you'd like
I propose moving DataFusion to only use
AggregateUDF
s and remove the built in list of AggregateFunctions for the same reasons as https://github.com/apache/arrow-datafusion/issues/8045We will keep the existing
AggregateUDF
interface as much as possible, while also potentially providing an easier way to define them.New AggregateUDF is in
functions-aggregate
crate Old Aggregate functions are indatafusion/physical-expr/src/aggregate
Describe alternatives you've considered
Additional context
Proposed implementation steps:
10695
9926
Move rust test to sqllogictest if possible #10384
Good first issue list
Pending
Feel free to file an issue if you are interested in working on any of the above in the pending list.