apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 957 forks source link

[Feature] Reduce redundant shuffle for spark dynamic bucket writes #3222

Open wForget opened 7 months ago

wForget commented 7 months ago

Search before asking

Motivation

Dynamic bucket writing does two shuffles, the first repartitionByKeyPartitionHash seems unnecessary, It seems to be only used to determine assignId. However, assignId can be calculated through partitionHash/keyHash/numParallelism/numAssigners, we do not need to do extra shuffle. Can we remove it?

https://github.com/apache/paimon/blob/e27ceb464244f5a0c2bfa2a7c6db649ca945212b/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala#L143

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

wForget commented 7 months ago

@YannByron could you please take a look?

JingsongLi commented 6 months ago

it is hard, Perhaps different assigners will have the same bucket data