apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.23k stars 2.17k forks source link

Snapshot chain getting broken - data incorrectly removed #11243

Open cristian-fatu opened 1 day ago

cristian-fatu commented 1 day ago

Apache Iceberg version

1.5.0

Query engine

Spark

Please describe the bug 🐞

We're running Iceberg with Spark, using Spark Structured Streaming to read from a Kafka topic and write to an Iceberg table. Recently we have started running a batch Spark job as well to backfill some older data into the same table. Both the streaming and the backfill job will run at the same time, inserting into the table concurrently (just inserts, no merge/deletes). The streaming job does 4 minute microbatches, while the backfill job is run on-demand and potentially inserts several times per minute.

We are seeing that on occassion it will happen that a new snapshot gets created, which has no parent snapshot id. When that happens, the data loaded in previous snapshots effectively becomes "invisible" when querying the table. Also, when we later expire snapshots and delete orphaned files, the older data is hard deleted (which sounds like the correct behavior).

It sounds like the problem is caused by concurrently updating the table but we haven't managed to reproduce it on-demand.

As additional symptoms, we noticed the following when looking at the metadata information:

made_current_at snapshot_id parent_id is_current_ancestor 2024-09-30 19:36:23.975000 4112309507491600842 4459680660798272782 true 2024-09-30 19:36:21.466000 4459680660798272782 (null) true 2024-09-30 19:36:19.863000 6444149358610591875 8676544541428494413 false 2024-09-30 19:36:18.948000 8676544541428494413 3452861481380993540 false

Other relevant info:

The target table has the following properties: history.expire.max-snapshot-age-ms 10800000 write.metadata.previous-versions-max 20 write.parquet.compression-codec zstd write.spark.accept-any-schema true write.metadata.delete-after-commit.enabled true

Willingness to contribute

cristian-fatu commented 1 day ago

It also looks like whenever this issue happens we get this in the Spark logs:

Retrying task after failure: Cannot commit REDACTED because base metadata location 's3://REDACTED/metadata/80740-e132a8c4-6481-441d-a1d8-5655699a61c4.metadata.json' is not same as the current Glue location 's3://REDACTED/metadata/80741-b80a0885-0864-46c1-a7b9-34819a27ff9f.metadata.json

However, I checked and this error is present also at other times, when no data goes missing.

cristian-fatu commented 2 hours ago

And just to clarify, we are only doing insert/append (i.e.: no updates or deletes) and the streaming and the batch jobs are writing to different partitions of the table.