fugue-project / fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
https://fugue-tutorials.readthedocs.io/
Apache License 2.0
1.98k stars 94 forks source link

[FEATURE] Add coarse partitioning concept #449

Closed goodwanghan closed 1 year ago

goodwanghan commented 1 year ago

When a dataframe has large number of small partitions, for group-map operations, distributed backends are not always efficient because there is too much communication overhead between the customer logic and the backends.

So we should have a way to partition in a less granular way: for each user defined group it is in one and only one partition, but for each partition it can contain multiple groups.

With coarse partitioning, we push part of the partitioning responsibility to local computing frameworks such as pandas, arrow and polars,. This in some cases can be significantly more efficient.