Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
476 stars 59 forks source link

Delete expired manifests and data files when removing snapshots. #17

Closed rdblue closed 6 years ago

rdblue commented 6 years ago

This adds logic after an expire snapshots commit to delete stale manifest and data files. It also adds a delete function to the ExpireSnapshots API so that the caller can use alternative delete logic.

For each snapshot that is removed, any data files that were marked deleted in that snapshot will be deleted. Because manifests can be reused, a manifest is only deleted if it is not referenced by any of the snapshots that are not yet expired.

This could delete files that are deleted an a snapshot and added back in a later snapshot. This could be updated to check whether the file is referenced by the current snapshot, but this solution is expensive and probably not worth the check. Callers should not delete files and re-add them without changing the file name or expiring the snapshot where the file was removed.

rdblue commented 6 years ago

@omalley, it would be great if you would review this. It implements manifest and file clean-up when snapshots are deleted.

This makes three choices I'd like your opinion on: