Incorrect schema used when using time-travel

fides-bot commented 1 week ago

Apache Iceberg version

1.5.0

Query engine

Spark

Please describe the bug 🐞

When using time travel to retrieve a previous version of a table via a snapshot ID, the table’s schema is used instead of the snapshot's schema, contrary to the documentation.

Reproduction code:

# Create the table
spark_session.sql(f"CREATE TABLE iceberg_test (id bigint, data string, col float)")

# Populate the table
spark_session.sql(f"INSERT INTO iceberg_test values (1, 'a', 1.0), (2, 'b', 2.0), (3, 'c', 3.0)")

# Rename 'col' to 'value'
spark_session.sql(f"ALTER TABLE iceberg_test RENAME COLUMN col TO value")

# Insert a new row
spark_session.sql(f"INSERT INTO iceberg_test values (4, 'd', 4.0)")

# Time-travel to the first snapshot_id provided by iceberg_test.snapshots
snapshot_1 = spark_session.sql(f"SELECT * FROM iceberg_test VERSION AS OF <INSERT SNAPSHOT ID>")

# Operation on the renamed field
snapshot_1.filter("col == 2.0").show()

We end up with the following error:

Py4JJavaError: An error occurred while calling o111.showString.
: org.apache.iceberg.exceptions.ValidationException: Cannot find field 'col' in struct: struct<1: id: optional long, 2: data: optional string, 3: value: optional float>

NOTES:

snapshot_1.printSchema() would confirm that the field name is col and not value, as per the last snapshot
The error also occurs when using the Spark DataFrame API

Willingness to contribute

[ ] I can contribute a fix for this bug independently
[ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
[X] I cannot contribute a fix for this bug at this time

jishangarg commented 1 week ago

Hi @fides-bot, can I know which version of Spark you are using?

fides-bot commented 1 week ago

Hi @jishangarg, we're using Spark 3.5.1

apache / iceberg

Incorrect schema used when using time-travel #11162

Apache Iceberg version

Query engine

Please describe the bug 🐞

Reproduction code:

Willingness to contribute