apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6k stars 2.1k forks source link

Extend Snapshot Metadata Lifecycle #10646

Open szehon-ho opened 4 weeks ago

szehon-ho commented 4 weeks ago

Proposed Change

Motivation

Currently, a snapshot's lifecycle is handled by 'ExpireSnapshots(long olderThan)'. This operation does the following:

-Choose a set of Snapshots to expire based on timestamp -Remove these Snapshots references from TableMetadata -Purge metadata files of these Snapshots -Purge data deleted by these Snapshots.

Purging expired Snapshot's deleted data often requires a smaller timeline, due to strict requirements to claw back unused disk space, fulfill data lifecycle compliance, etc. In many deployments, this means an 'olderThan' timestamp that is set to just a few days before the current time (the default is 5 days).

On the other hand, removing expired Snapshot references from TableMetadata may be ideally done on a more relaxed timeline, such as months or more, to allow for meaningful historical table analysis.

But today, these are all purged together, and we cannot preserve just the Snapshot references for a longer period than that which is required for purging deleted expired Snapshot data.

Proposal document

https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit

Specifications

anuragmantri commented 4 weeks ago

Thanks @szehon-ho. +1 to the general idea, I will take a look at the design doc.