apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.07k stars 904 forks source link

[Improvement] Support local sort in the Spark's insertRepartitionBeforeWrite optimization rule #5057

Open touchida opened 1 year ago

touchida commented 1 year ago

Code of Conduct

Search before asking

What would you like to be improved?

Currently, the Spark's insertRepartitionBeforeWrite optimization rule will be skipped when logical plans are Sort regardless of whether they are local or not: https://github.com/apache/kyuubi/blob/fa9e6be/extensions/spark/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/RepartitionBeforeWritingBase.scala#L133. It makes sense for global sort, since inserting repartition after the sort changes the semantics of the original plans and doing before that only introduces an additional shuffle. However, inserting repartition before local sort will help to sort rebalanced partitions even if locally, and it aligns with the behavior of queries that explicitly use both REPARTITION|REBALANCE and SORT BY.

How should we improve?

This issue proposes to support local sort in the Spark's insertRepartitionBeforeWrite optimization rule by inserting repartition before the sort.

Are you willing to submit PR?

pan3793 commented 1 year ago

cc @ulysses-you @cxzl25 @cfmcgrady what do you think about this idea?