apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[GLUTEN-7028][CH][Part-8] Support one pipeline write for partition mergetree #7924

Closed baibaichen closed 1 week ago

baibaichen commented 1 week ago

What changes were proposed in this pull request?

(Fixes: #7028) The following digram shows the current class hierarchy, SparkPartitionedBaseSink inherits from ch's DB::PartitionedSink

WriteStatsBase
  |- MergeTreeStats  <--- collect stats at finish  -----------------------|
  |- WriteStats      <--- collect stats at consume ---|                   |
                                                      |                   |
SparkPartitionedBaseSink                              |                   |
  |- SubstraitPartitionedFileSink      ---create --> SubstraitFileSink    |
  |- SparkMergeTreePartitionedFileSink ---create --> SparkMergeTreeSink --|

The partition MergeTree in pipeline write looks like this, it squashes block before partitiion for whole input:

  // spark 3.5
  Input pipeline 
    => PlanSquashingTransform 
      => ApplySquashingTransform 
       => SparkMergeTreePartitionedFileSink
          => SparkMergeTreeSink
          => SparkMergeTreeSink
          => ...
        => MergeTreeStats

It differs from spark 3.3 which squashes block after partitiion for each partition, since parition is triggerd by JVM.

The new implemwentation is same as clickhouse.

How was this patch tested?

Using existed UTs

github-actions[bot] commented 1 week ago

https://github.com/apache/incubator-gluten/issues/7028

github-actions[bot] commented 1 week ago

Run Gluten Clickhouse CI on x86