Snapshot chain getting broken - data incorrectly removed

cristian-fatu commented 1 month ago

Apache Iceberg version

1.5.0

Query engine

Spark

Please describe the bug 🐞

We're running Iceberg with Spark, using Spark Structured Streaming to read from a Kafka topic and write to an Iceberg table. Recently we have started running a batch Spark job as well to backfill some older data into the same table. Both the streaming and the backfill job will run at the same time, inserting into the table concurrently (just inserts, no merge/deletes). The streaming job does 4 minute microbatches, while the backfill job is run on-demand and potentially inserts several times per minute.

We are seeing that on occassion it will happen that a new snapshot gets created, which has no parent snapshot id. When that happens, the data loaded in previous snapshots effectively becomes "invisible" when querying the table. Also, when we later expire snapshots and delete orphaned files, the older data is hard deleted (which sounds like the correct behavior).

It sounds like the problem is caused by concurrently updating the table but we haven't managed to reproduce it on-demand.

As additional symptoms, we noticed the following when looking at the metadata information:

taking this sequence of snapshots as an example from the $history table:

made_current_at snapshot_id parent_id is_current_ancestor 2024-09-30 19:36:23.975000 4112309507491600842 4459680660798272782 true 2024-09-30 19:36:21.466000 4459680660798272782 (null) true 2024-09-30 19:36:19.863000 6444149358610591875 8676544541428494413 false 2024-09-30 19:36:18.948000 8676544541428494413 3452861481380993540 false

in this example, only data starting with snapshot_id=4459680660798272782 remains visible
looking at this excerpt from the .medata.json file when the snapshot_id=4459680660798272782 was created, we can notice that older snapshots were using schema-id=91 while the 4459680660798272782 is using an older schema-id=3; it's unclear if this is related or not.

Other relevant info:

running Spark 3.5.0
Iceberg 1.5.0
AWS Glue catalog
AWS S3 as storage

The target table has the following properties: history.expire.max-snapshot-age-ms 10800000 write.metadata.previous-versions-max 20 write.parquet.compression-codec zstd write.spark.accept-any-schema true write.metadata.delete-after-commit.enabled true

Willingness to contribute

[ ] I can contribute a fix for this bug independently
[X] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
[ ] I cannot contribute a fix for this bug at this time

cristian-fatu commented 1 month ago

It also looks like whenever this issue happens we get this in the Spark logs:

Retrying task after failure: Cannot commit REDACTED because base metadata location 's3://REDACTED/metadata/80740-e132a8c4-6481-441d-a1d8-5655699a61c4.metadata.json' is not same as the current Glue location 's3://REDACTED/metadata/80741-b80a0885-0864-46c1-a7b9-34819a27ff9f.metadata.json

However, I checked and this error is present also at other times, when no data goes missing.

cristian-fatu commented 1 month ago

And just to clarify, we are only doing insert/append (i.e.: no updates or deletes) and the streaming and the batch jobs are writing to different partitions of the table.

apache / iceberg