apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.92k stars 1.12k forks source link

[Epic] A collection of issues for extending the Aggregation function #12254

Open Weijun-H opened 2 weeks ago

Weijun-H commented 2 weeks ago

Is your feature request related to a problem or challenge?

DataFusion now supports several aggregation functions, but it still lacks some common ones that are essential for a broader range of data processing tasks. To make DataFusion more versatile and capable of handling diverse workloads, it should include additional aggregation functions commonly used in data analysis, such as mode and max_by.

Describe the solution you'd like

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 1 week ago

I wonder if we should consider where to draw the line on what aggregate functions to include in the core (i.e. should we include all these new functions?)

Now that all aggregate functions use the same API, we could potentially keep more specialized functions such as listed here outside the ore -- either in its own crate or even own repo -- and then have other code integrate it in -- e.g. https://github.com/apache/datafusion/issues/11979

alamb commented 1 week ago

I started a discussion about if we should be adding all these functions directly in the core here: https://github.com/apache/datafusion/issues/12357

Weijun-H commented 1 week ago

I wonder if we should consider where to draw the line on what aggregate functions to include in the core (i.e. should we include all these new functions?)

Now that all aggregate functions use the same API, we could potentially keep more specialized functions such as listed here outside the ore -- either in its own crate or even own repo -- and then have other code integrate it in -- e.g. #11979

I like this idea! 🚀

alamb commented 7 hours ago

@Weijun-H and @dmitrybugakov and @dharanad -- what do you think about creating a datafusion-functions-duckdb repo in datafusion-contrib similar to https://github.com/datafusion-contrib/datafusion-functions-json for JSON from @samuelcolvin and co.

It would be a pretty neat way to help build out the function library in DataFUsion and would show off its extensibility

I could then try an integrate it into dft that @matthewmturner and I have been working on: https://github.com/datafusion-contrib/datafusion-dft which would make it easer to use

Originally from: https://github.com/apache/datafusion/pull/12476#issuecomment-2353611810

austin362667 commented 6 hours ago

Thank you @alamb for proposing this initiative. I like this idea. What about others' thought? It clearly draws a line between the core and the extensions. And we can still leverage those functions as extension in dft.