delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.65k stars 1.72k forks source link

[Protocol Change Request] Improving Time Travel using In-Commit Timestamps #2532

Open dhruvarya-db opened 10 months ago

dhruvarya-db commented 10 months ago

Feature request

Overview

This feature request is about changing Delta commit timestamps to improve time travel.

Motivation

Delta currently relies on the file modification time to identify the timestamp of a commit. This timestamp is used for time travel queries, log cleanup, and staleness checks. However, file modification time is not a very reliable way of getting a timestamp — this can easily change when the files are copied/moved to another directory (e.g. for disaster recovery purposes) or when any manual fixes are performed to the Delta log. In such cases, time travel on the delta table breaks as of today. The possibility of non-monotonic file timestamps also adds lots of code complexity in Delta as we try to handle it heuristically in the best possible way.

Further details

We propose a new Writer feature that will require clients to generate a timestamp just before performing a commit and store it in the commit itself.

Compliant writers will ensure that the timestamp stored in Commit X+1 is always greater than Commit X. To be able to ensure this, the client will need to perform conflict detection for these timestamps.

  1. The writer will write this timestamp in the CommitInfo action. Furthermore, the writer will always write CommitInfo as the first action in a commit.
  2. Clients that understand these new timestamps will now read the commit file to get the actual timestamp. These timestamps will now be used for time travel queries and by other operations that use timestamps.

The detail proposal and the required protocol changes are sketched out in this doc.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

dhruvarya-db commented 7 months ago

This is being released as a preview feature in 3.2 (https://github.com/delta-io/delta/pull/2962). The feature will be generally available by 4.0.