delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.49k stars 1.68k forks source link

[BUG][Spark] Disabling DELTA_STATS_SKIPPING fails requirement when reading tables with DVs #2443

Open andreaschat-db opened 8 months ago

andreaschat-db commented 8 months ago

Bug

Which Delta project/connector is this regarding?

Describe the problem

Disabling DELTA_STATS_SKIPPING causes operations with DVs to fail with: Cannot work with a non-pinned table snapshot of the TahoeFileIndex when reading a table with DVs. The issue is caused because when we disable DELTA_STATS_SKIPPING the plan contains a TahoeLogFileIndex. This causes the requirement to fail at PreprocessTableWithDVs.dvEnabledScanFor to fail.

Steps to reproduce

The issue can be reproduced with:

  test("TEST non-pinned table bug") {
    withTable("delta_target") {
      withSQLConf(DeltaSQLConf.DELTA_STATS_SKIPPING.key -> false.toString) {
        sql(
          s"""
             |CREATE TABLE delta_target(key2 INT, value INT)
             |USING delta
             |OPTIONS('path'='$tempPath')
        """.stripMargin)

        sql("DELETE FROM delta_target WHERE key2 = 0")
        sql("SELECT key2, value FROM delta_target").collect()
      }
    }
  }

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

vkorukanti commented 8 months ago

The reason for the assert is because TahoeLogFileIndex is unstable that is not yet pinned to a particular version of the table snapshot. According to @ryan-johnson-databricks , if it hasisTimeTravel=true, then it is stable. There is a special case in PrepareDeltaScan when data skipping is disabled, it generates TahoeLogFileIndex with isTimeTravel=fasle.

More on this assert: We have a rule PreprocessTablesWithDVs that alters the query scan with an additional (filePath to DV) broadcast map for reading DVs. This rules immediately after the PrepareDeltaScan rule here. In PreprocessTablesWithDVs, we enforce a requirement that the scan shouldn’t have TahoeLogFileIndex as it is not stable as it is not yet pinned. However when data skipping is disabled, we keep the TahoeLogFileIndex with isTimeTravel=false. This fails the assert in PreprocessTablesWithDVs.