delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.58k stars 1.7k forks source link

[Feature Request] FSCK REPAIR TABLE sql command #3436

Open Sovima opened 3 months ago

Sovima commented 3 months ago

Feature request

Which Delta project/connector is this regarding?

Overview

The problem: Sometimes, it is possible that a delta log references parquet files or deletion vector files that do not exist in the filesystem. One specific example of that can happen when the following sequence of actions occurs:

  1. A table is created and several entries are added
  2. Some files are deleted
  3. Vacuum command is applied to physically remove the deleted files
  4. We time travel to a version of the table before step 2 has happened In this case we end up with a table that has a reference to a file but that file has been vacuumed so the table becomes not usable.

Right now, we have no way of selecting from tables that have delta log references to missing files. The FSCK command would make the table usable by removing those references.

Motivation

This feature will be used to make the table usable again. At the moment, if a Delta Log contains a reference to a file that does not exist in the filesystem, an error is thrown.

Users have previously requested this feature to be added. See https://github.com/delta-io/delta/issues/748#issuecomment-930724418

Further details

Some of the features this command should have:

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

felipepessoto commented 2 months ago

Hi, any thoughts from the maintainers? @vkorukanti , @allisonport-db

felipepessoto commented 12 hours ago

+ @scottsand-db @dennyglee, please we need your thought on this: