delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[BUG][Spark] delta-spark allows reading column mapping when missing from table features #3890

Open zachschuermann opened 2 days ago

zachschuermann commented 2 days ago

Bug

Which Delta project/connector is this regarding?

Describe the problem

TLDR you can relatively easily create a table which (according to the protocol) shouldn't allow column mapping, but is read with column mapping in delta-spark.

I think there are two pieces to this issue:

  1. [bug] delta-spark uses column mapping to read a table without column mapping in table reader features
  2. [api sharp edge?] delta's upgradeTableProtocol will upgrade from reader version 2 to reader version 3 without adding any table features. This is a problem since it effectively silently turns of column mapping. (since it is enabled/supported in reader version 2 but requires that the table feature be present when reader version is 3)

Steps to reproduce

See example below for code implementing these steps:

  1. the table is created with reader version 2 and writer version 7 with "writerFeatures":["columnMapping","icebergCompatV1"] and delta.columnMapping.mode = name
  2. then upgradeTableProtocol(3, 7) gives reader version 3 with no reader features. this effectively turns off column mapping.
  3. when reading the table it looks like it is read with columnMapping = name
# using pyspark
df = get_sample_data(spark)
delta_path = str(Path(case.delta_root).absolute())
# table at version 0
delta_table: DeltaTable = (
    DeltaTable.create(spark)
    .location(delta_path)
    .addColumns(df.schema)
    .property("delta.enableIcebergCompatV1", "true")
    .execute()
)
delta_table.upgradeTableProtocol(3, 7)
df.repartition(1).write.format("delta").mode("append").save(case.delta_root)

Observed results

Read with column mapping

Expected results

Should not be read with column mapping

Further details

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?