[Bug] spark fixed bucket write causes task data process unevenly for pk table

Search before asking

[X] I searched in the issues and found nothing similar.

Paimon version

0.8

Compute Engine

spark

Minimal reproduce step

When we use fixed bucket to write data for pk table, we set 1000 buckets, and want to process data with 1000 partitions evenly. But we found some tasks process no data and some tasks process many data. This will cause the overall running time of the task to be longer.

We found the root cause is repartitionByExpression will perform hash calculation again based on the bucket column. That will result in uneven distribution of partitions.

Some our configs:

primary-key = uuid
bucket = 1000
spark.sql.shuffle.partitions=1000

What doesn't meet your expectations?

partitions data distribution.

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

apache / paimon