groupBy doesn't support a specified number (like repartition), and this is an effective way to control parallelism.
Even repartitioning (supported by Fugue) can't use Pandas UDF
But now there seems to be a solution. Before using Pandas UDF partitioning, if we do repartition, Spark respects the partitions. So we will use this trick to enable more scenarios on Pandas UDF.
There are limitations on Pandas UDF partitioning
groupBy
doesn't support a specified number (likerepartition
), and this is an effective way to control parallelism.But now there seems to be a solution. Before using Pandas UDF partitioning, if we do
repartition
, Spark respects the partitions. So we will use this trick to enable more scenarios on Pandas UDF.