apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.1k stars 834 forks source link

[spark] Optimize sort compact procedure by submitting job by partition #3613

Closed Zouxxyy closed 2 days ago

Zouxxyy commented 6 days ago

Purpose

Currently, spark sort compact will sort all the filtered data globally, and then write them by dynamic partition overwrite, which will lead to the following problems:

  1. If the partition is not specified, data will be sorted globally, which is an unnecessary overhead
  2. because it is a full range shuffle, small files are uncontrollable, for all reducers may contain all partitions

Therefore, this PR make a single partition as a sort compact group and then submit it to spark, a new conf ~max_order_threads~ max_concurrent_jobs is introduced in sort compact procedure, the maximum number of concurrent job submissions, default is 15.

Tests

Test on 1T tpc-ds, sort compact web_sales on ws_bill_customer_sk, before 233s -> after 108s, and the target file size can be controled by spark.sql.shuffle.partitions

API and Format

Documentation