When using time travel to retrieve a previous version of a table via a snapshot ID, the table’s schema is used instead of the snapshot's schema, contrary to the documentation.
Reproduction code:
# Create the table
spark_session.sql(f"CREATE TABLE iceberg_test (id bigint, data string, col float)")
# Populate the table
spark_session.sql(f"INSERT INTO iceberg_test values (1, 'a', 1.0), (2, 'b', 2.0), (3, 'c', 3.0)")
# Rename 'col' to 'value'
spark_session.sql(f"ALTER TABLE iceberg_test RENAME COLUMN col TO value")
# Insert a new row
spark_session.sql(f"INSERT INTO iceberg_test values (4, 'd', 4.0)")
# Time-travel to the first snapshot_id provided by iceberg_test.snapshots
snapshot_1 = spark_session.sql(f"SELECT * FROM iceberg_test VERSION AS OF <INSERT SNAPSHOT ID>")
# Operation on the renamed field
snapshot_1.filter("col == 2.0").show()
We end up with the following error:
Py4JJavaError: An error occurred while calling o111.showString.
: org.apache.iceberg.exceptions.ValidationException: Cannot find field 'col' in struct: struct<1: id: optional long, 2: data: optional string, 3: value: optional float>
NOTES:
snapshot_1.printSchema() would confirm that the field name is col and not value, as per the last snapshot
The error also occurs when using the Spark DataFrame API
Willingness to contribute
[ ] I can contribute a fix for this bug independently
[ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
[X] I cannot contribute a fix for this bug at this time
Apache Iceberg version
1.5.0
Query engine
Spark
Please describe the bug 🐞
When using time travel to retrieve a previous version of a table via a snapshot ID, the table’s schema is used instead of the snapshot's schema, contrary to the documentation.
Reproduction code:
We end up with the following error:
NOTES:
snapshot_1.printSchema()
would confirm that the field name iscol
and notvalue
, as per the last snapshotWillingness to contribute