apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[GLUTEN-7028][CH][Part-9] Collecting Delta stats for parquet #7993

Closed baibaichen closed 19 hours ago

baibaichen commented 2 days ago

What changes were proposed in this pull request?

Introducing DeltaStats to collect stats as delta does.

(Fixes: #7028)

How was this patch tested?

Using Existed Uts

In test("test parquet table write with the delta"), adding logic to verify delta stats

    if (spark35) {
      val vanillaTable = "lineitem_delta_parquet_vanilla"
      withSQLConf((GlutenConfig.NATIVE_WRITER_ENABLED.key, "false")) {
        doInsert(drop(vanillaTable), createLineitem(vanillaTable), insert(vanillaTable))
      }
      val expected = DeltaStatsUtils
        .statsDF(
          spark,
          s"$basePath/$vanillaTable/_delta_log/00000000000000000001.json",
          q1SchemaString)
        .collect()

      checkAnswer(
        DeltaStatsUtils.statsDF(
          spark,
          s"$basePath/$table/_delta_log/00000000000000000001.json",
          q1SchemaString),
        expected
      )
    }
github-actions[bot] commented 2 days ago

https://github.com/apache/incubator-gluten/issues/7028

github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 2 days ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 22 hours ago

Run Gluten Clickhouse CI on x86