Open keen85 opened 1 year ago
@prakharjain09 - do you think you could help here?
Yes this seems like a good idea. Instead of deleting every entry prior to the 30 day mark (delta.logRetentionDuration
), We could identify the newest checkpoint before the 30 day boundary and delete everything before that.
We'd have to keep any intervening commits as well.
Two possible user experience challenges, tho:
Is calculating and writing a checkpoint expensive?
If not, why not calc and create an additional checkpoint for the oldest version that is kept (the oldest version that is younger than delta.logRetentionDuration
), if it does not exists already?
This way timetravel will work reliably for logRetentionDuration
(and not longer); no need to keep older things.
It actually can be quite expensive for large tables (I've seen 2+ minutes), but it also opens up the possibility of racing checkpoint writers, if two clients are both trying to cleanup at the same time, and both try to write that checkpoint at the same time. See https://github.com/delta-io/delta/issues/1727#issuecomment-1542870175 for an ongoing discussion about what can go wrong when there are races...
Feature request
Overview
As specified in the documentation, there are scenarios where time travel to a previous version is not possible even tough a version is younger than the threshold specified by
delta.logRetentionDuration
and data files are present (see "Note" box in documentation). I propose making the logic for selecting and deleting log files and checkpoint files smarter, so that versions younger thandelta.logRetentionDuration
will be readable (as long as underlying data files were not deleted byVACUUM
).Motivation
Current configuration of data retention might indicate to inexperienced users that time traveling to versions younger than the minimum of
delta.logRetentionDuration
anddelta.deletedFileRetentionDuration
will always be possible. However, there are edge cases where this is not true and where restoring a prior version after accidental operations using time travel will not work.Further details
See discussion on Slack: https://delta-users.slack.com/archives/CJ70UCSHM/p1682613236651509
When a new checkpoint is written, deletion of log files and checkpoint files older than
delta.logRetentionDuration
is carried out. I propose introducing another condition here, checking if these deletion candidates are required by any version that is younger thandelta.logRetentionDuration
.Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?