apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.2k stars 2.16k forks source link

Does expireSnapshotId delete older snapshots data files? #8410

Open DavidCampanero opened 1 year ago

DavidCampanero commented 1 year ago

Query engine

Spark

Question

I've writed an script that keep the first snapshot of each month and the last few days. But once I delete the snapshots older thatn 7 days that are not the first of each month with expireSnapshotId it seems like I don't have access to previous data even if the metadata files (json and avro) are still there. But the data that the avro file references it's no longer there.

So I don't know if "break the chain" means that I will lose the data and I will not be able to time travel to check how it was the data 1 year ago if i have deleted data in the middle.

RussellSpitzer commented 1 year ago

The Spark action for this takes the difference between files reachable after the expire snapshots and before the expire snapshots and deletes that. So the Spark Action would preserve data files. Now for the pure java version, the implementation is much more complicated but the intent is the same although it may not be correct in this use case.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.