delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Feature Request] `logRetentionDuration`: only delete log files and checkpoint files if not required by a version younger than `logRetentionDuration` #1728

Open keen85 opened 1 year ago

keen85 commented 1 year ago

Feature request

Overview

As specified in the documentation, there are scenarios where time travel to a previous version is not possible even tough a version is younger than the threshold specified by delta.logRetentionDuration and data files are present (see "Note" box in documentation). I propose making the logic for selecting and deleting log files and checkpoint files smarter, so that versions younger than delta.logRetentionDuration will be readable (as long as underlying data files were not deleted by VACUUM).

Motivation

Current configuration of data retention might indicate to inexperienced users that time traveling to versions younger than the minimum of delta.logRetentionDuration and delta.deletedFileRetentionDuration will always be possible. However, there are edge cases where this is not true and where restoring a prior version after accidental operations using time travel will not work.

Further details

See discussion on Slack: https://delta-users.slack.com/archives/CJ70UCSHM/p1682613236651509

When a new checkpoint is written, deletion of log files and checkpoint files older than delta.logRetentionDuration is carried out. I propose introducing another condition here, checking if these deletion candidates are required by any version that is younger than delta.logRetentionDuration.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

scottsand-db commented 1 year ago

@prakharjain09 - do you think you could help here?

prakharjain09 commented 1 year ago

Yes this seems like a good idea. Instead of deleting every entry prior to the 30 day mark (delta.logRetentionDuration), We could identify the newest checkpoint before the 30 day boundary and delete everything before that.

ryan-johnson-databricks commented 1 year ago

We'd have to keep any intervening commits as well.

Two possible user experience challenges, tho:

  1. This could mean we leave some very old metadata files around. For example, a table that commits once per week might only have a checkpoint every ten weeks? That said, the cleanup is still effective -- any table with lots of commits will also have newer checkpoints to work from.
  2. If we start keeping older things around, users could start relying on time travel beyond the log retention/vacuum threshold. For consistency, we might need to artificially restrict time travel so that it matches the table's log retention period (even if we didn't actually get around to deleting the older files yet).
keen85 commented 1 year ago

Is calculating and writing a checkpoint expensive? If not, why not calc and create an additional checkpoint for the oldest version that is kept (the oldest version that is younger than delta.logRetentionDuration), if it does not exists already? This way timetravel will work reliably for logRetentionDuration (and not longer); no need to keep older things.

ryan-johnson-databricks commented 1 year ago

It actually can be quite expensive for large tables (I've seen 2+ minutes), but it also opens up the possibility of racing checkpoint writers, if two clients are both trying to cleanup at the same time, and both try to write that checkpoint at the same time. See https://github.com/delta-io/delta/issues/1727#issuecomment-1542870175 for an ongoing discussion about what can go wrong when there are races...