apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.21k stars 957 forks source link

UDAF: Add more fields to state fields #10391

Closed jayzhan211 closed 3 days ago

jayzhan211 commented 1 week ago

Which issue does this PR close?

Closes #.

I check the field and state_fields that are used for most of the aggregate functions. I think it is nice to split them out as a PR

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 commented 1 week ago

@alamb This change is based on what the built-in function does, and what we need for UDAF to have the same behavior. If the change is not clear if it is necessary, we can also change it when moving the function to UDAF gradually.

This PR exists because 1. breaking API changes at once, so we don't need to keep breaking API after that. 2. Split the change out can also reduce the review size of other functions moving (but less straightforward if the change is worth it)

if we found out there is a better way to rewrite what the current builtin function does, the change here might be useless, so I think both way has pros and cons

jayzhan211 commented 1 week ago

StateFieldsArgs DistinctArrayAgg needs nullable ~Sum needs return_type Most of the functions needs input_type.~

FieldArgs ~Avg needs return_type~ ~Most of the functions needs input_type.~ OrderSensitiveArrayAgg needs nullable and order_by_data_types

Upd: I was mislead by the naming, it seems we just need the value_type what we have now, which most of the functions are input_type, while functions like Sum or Count has a different type (return_type).

alamb commented 1 week ago

@alamb This change is based on what the built-in function does, and what we need for UDAF to have the same behavior. If the change is not clear if it is necessary, we can also change it when moving the function to UDAF gradually.

Ah, interesting, I didn't realize this. Sorry

I think it would be useful to port UDFs over one by one, and as we find gaps in the APIs we can adjust them

Or if we want to discover all the potential issues we could do a WIP PR (as you have done already) to try to move the aggregates and when we hit a block fix the underlying issues in a different PR