[X] I searched in the issues and found nothing similar.
Paimon version
0.8
Compute Engine
spark
Minimal reproduce step
When we use fixed bucket to write data for pk table, we set 1000 buckets, and want to process data with 1000 partitions evenly. But we found some tasks process no data and some tasks process many data. This will cause the overall running time of the task to be longer.
We found the root cause is repartitionByExpression will perform hash calculation again based on the bucket column. That will result in uneven distribution of partitions.
Search before asking
Paimon version
0.8
Compute Engine
spark
Minimal reproduce step
When we use fixed bucket to write data for pk table, we set 1000 buckets, and want to process data with 1000 partitions evenly. But we found some tasks process no data and some tasks process many data. This will cause the overall running time of the task to be longer.
We found the root cause is
repartitionByExpression
will perform hash calculation again based on the bucket column. That will result in uneven distribution of partitions.Some our configs:
What doesn't meet your expectations?
partitions data distribution.
Anything else?
No response
Are you willing to submit a PR?