apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[Python][Acero] Provide method to perform aggregations with acero for datasets #44168

Open sidneymau opened 1 month ago

sidneymau commented 1 month ago

Describe the enhancement requested

Presently, Dataset has methods to perform several actions—sort_by, join, and join_asof—with Acero. It would be especially helpful to provide a method to perform aggregations on datasets using Acero for convenient out of core processing.

The implementation can be modeled off of the existing Dataset Acero operations as well as the aggregate method of TableGroupBy.

Component(s)

Python

sidneymau commented 1 month ago

Note that the implementation proposed in the above PR ends up being fairly inefficient because it can't fully leverage nodes for, e.g., projections and filtering. If interested, this functionality could be included—basically providing a dataframe-like interface to constructing an Acero plan as can be done with DataFusion—but that is a bit larger in scope