[FEATURE] Spark Pandas UDF partitioning improvement

fugue-project / fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

https://fugue-tutorials.readthedocs.io/

Apache License 2.0

1.98k stars 94 forks source link

[FEATURE] Spark Pandas UDF partitioning improvement #493

Closed goodwanghan closed 1 year ago

goodwanghan commented 1 year ago

There are limitations on Pandas UDF partitioning

groupBy doesn't support a specified number (like repartition), and this is an effective way to control parallelism.
Even repartitioning (supported by Fugue) can't use Pandas UDF

But now there seems to be a solution. Before using Pandas UDF partitioning, if we do repartition, Spark respects the partitions. So we will use this trick to enable more scenarios on Pandas UDF.