[GLUTEN-7028][CH][Part-8] Support one pipeline write for partition mergetree

baibaichen commented 1 week ago

What changes were proposed in this pull request?

(Fixes: #7028) The following digram shows the current class hierarchy， SparkPartitionedBaseSink inherits from ch's DB::PartitionedSink

WriteStatsBase
  |- MergeTreeStats  <--- collect stats at finish  -----------------------|
  |- WriteStats      <--- collect stats at consume ---|                   |
                                                      |                   |
SparkPartitionedBaseSink                              |                   |
  |- SubstraitPartitionedFileSink      ---create --> SubstraitFileSink    |
  |- SparkMergeTreePartitionedFileSink ---create --> SparkMergeTreeSink --|

The partition MergeTree in pipeline write looks like this, it squashes block before partitiion for whole input:

  // spark 3.5
  Input pipeline 
    => PlanSquashingTransform 
      => ApplySquashingTransform 
       => SparkMergeTreePartitionedFileSink
          => SparkMergeTreeSink
          => SparkMergeTreeSink
          => ...
        => MergeTreeStats

It differs from spark 3.3 which squashes block after partitiion for each partition, since parition is triggerd by JVM.

The new implemwentation is same as clickhouse.

How was this patch tested?

Using existed UTs

github-actions[bot] commented 1 week ago

https://github.com/apache/incubator-gluten/issues/7028

github-actions[bot] commented 1 week ago

Run Gluten Clickhouse CI on x86

apache / incubator-gluten

[GLUTEN-7028][CH][Part-8] Support one pipeline write for partition mergetree #7924

What changes were proposed in this pull request?

How was this patch tested?