delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.45k stars 1.67k forks source link

[Protocol Change Request] Delta Coordinated Commits #2598

Open prakharjain09 opened 7 months ago

prakharjain09 commented 7 months ago

Feature request

Which Delta project/connector is this regarding?

This will be a PROTOCOL change.

Overview

Today’s Delta commit protocol relies on the filesystem to provide commit atomicity. This writer-only table feature request is to allow an external commit-coordinator to decide whether a commit succeeded instead of relying on filesystem atomicity, similar to Iceberg's metastore tables. The filesystem remains the source of truth about the actual content of commits, and (unlike Iceberg metastore tables) filesystem-based readers can still access the table directly. This allows us to deal with various limitations of delta mentioned below in Motivation section.

Motivation

Today’s Delta commit protocol relies on the filesystem to provide commit atomicity, both for durability and mutual exclusion purposes. This is problematic for several reasons:

  1. No reliable way to involve a catalog with the commit 1.1. Impossible to keep the catalog and the table state in sync, and the catalog cannot reject commit attempts it wouldn’t like, because it doesn’t even know about them until after they become durable (and visible to readers). 1.2. No clear path to transactions that could span multiple tables and/or involve catalog updates, because filesystem commits cannot be made conditionally or atomically.
  2. No way to tie commit ownership to a table 2.1. In general, Delta tables have no way to advertise that they are managed by catalog or LogStore X (at endpoint Y) 2.2. Today a table could be configured to use different log stores in different clusters. Each of the LogStore implementations tries to implement put-if-absent on their own. Since these implementations are not aware of each other, this leads to lost commits and table corruption. 2.3. The logStore setting today is cluster-wide, so you can't safely mix tables with different logStores.
  3. No way to do commits over multiple tables atomically 3.1 We need a way where we can commit to multiple tables at once, allowing us to support multi-table transactions.

Further details

The detail proposal and the required protocol changes are sketched out in this doc.

At high level: We propose a new Delta writer table feature, coordinatedCommits. The table feature includes the following capabilities:

  1. Conforming Delta clients will refuse to attempt filesystem-based commits against a table that enables coordinateCommits.
  2. A table that uses coordinated commits can include metadata (dictated by the table’s commit coordinator) that would-be writers can use to correctly perform a commit.
  3. Delta client passes the actions that need to be committed to the commit-store. a. The commit-store writes the actions in a unique commit file. b. The commit-store then does a commit as per its spec.
  4. If a commit fails due to physical conflict (e.g. two racing blind appends), the client can verify no logical conflicts exist, rebase their actions and then reattempt the commit with commit-store with an updated version, after.
  5. Once commit V has been ratified, it should be copied to V.json (= “backfilled”) in order to bound the amount of history the commit coordinator is required to keep for each table.
  6. The commit-store maintains information about all the un-backfilled commits which will be used by other other delta clients to access the most recent snapshot of the table.
  7. At some point after backfill completes, the commit coordinator deletes the internal mapping for that commit.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

tdas commented 7 months ago

Can we remove the checkbox for which project/connector does this apply to? This is a protocol change, it will eventually apply to every one!

felipepessoto commented 6 months ago

@prakharjain09, at "Converting a managed-commit table to filesystem table" it says:

  1. Write the commit which removes the ownership.
    • Either the commit-owner writes the backfilled commit file directly.
    • Or it writes an unbackfilled commit and ensures that it is backfilled reliably. Until the backfill is done, table will be in unusable state:
      • the filesystem based delta clients won't be able to write to such table as they still believe that table has managed-commit enabled.
      • the managed-commit aware delta clients won't be able to write to such table as the commit-owner won't accept any new commits. In such a scenario, they could backfill required commit themselves (preferably using PUT-if-absent) to unblock themselves.

After the un-backfilled commit, a managed commits aware delta reader could contact the commit-owner to get recent commits. In that case, the commit-owner would return the commit which removes the ownership? If so, the Delta client would need to be smart to know the ownership removal information is still un-backfilled and don't try to commit to filesystem.

felipepessoto commented 5 months ago

@prakharjain09, @sumeet-db, could you help validate if this question is relevant?

@prakharjain09, at "Converting a managed-commit table to filesystem table" it says:

  1. Write the commit which removes the ownership.

    • Either the commit-owner writes the backfilled commit file directly.
    • Or it writes an unbackfilled commit and ensures that it is backfilled reliably. Until the backfill is done, table will be in unusable state:

      • the filesystem based delta clients won't be able to write to such table as they still believe that table has managed-commit enabled.
      • the managed-commit aware delta clients won't be able to write to such table as the commit-owner won't accept any new commits. In such a scenario, they could backfill required commit themselves (preferably using PUT-if-absent) to unblock themselves.

After the un-backfilled commit, a managed commits aware delta reader could contact the commit-owner to get recent commits. In that case, the commit-owner would return the commit which removes the ownership? If so, the Delta client would need to be smart to know the ownership removal information is still un-backfilled and don't try to commit to filesystem.

prakharjain09 commented 5 months ago

After the un-backfilled commit, a managed commits aware delta reader could contact the commit-owner to get recent commits. In that case, the commit-owner would return the commit which removes the ownership? If so, the Delta client would need to be smart to know the ownership removal information is still un-backfilled and don't try to commit to filesystem.

@felipepessoto Thats correct. The Delta client would need to be smart enough to know:

This above requirements could be baked in the managed-commit spec. Essentially if the table has managed-commit enabled, then the writer has to do following extra steps if it is not "active" (i.e. table doesn't have an owner):

"managed-commit enabled" => PROTOCOL has managed-commit enabled "managed-commit active" => PROTOCOL has managed-commit enabled and Metadata contains an owner

So when a user wants to downgrade and remove managed-commits completely from the table (for compat reasons etc), it will be a 2 step process:

scovich commented 5 months ago

This feature request is to allow delta tables whose source of truth for the actual commit is an external commit-owner and not the filesystem (s3, abfs etc)

This is confusing to readers. We should clarify to match what the design actually calls for:

This writer-only table feature request is to allow an external commit-owner to decide whether a commit succeeded instead of relying on filesystem atomicity, similar to Iceberg's metastore tables. The filesystem remains the source of truth about the actual content of commits, and (unlike Iceberg metastore tables) filesystem-based readers can still access the table directly.

adriangb commented 4 months ago

I came here from https://github.com/delta-io/delta-rs/discussions/2455#discussion-6564431. After giving the proposal a quick read (apologies if I’m missing something) I fear that it may not address or improve my use case in https://github.com/delta-io/delta-rs/discussions/2455#discussion-6564431. In particular, this proposal does not do much for concurrent writers. In fact, it goes a bit further towards prohibiting the approach I’m exploring than I’ve found in any other spec: “ Multiple clients might try to contact the commit-store for attempting the same commit version. In such a case, the commit-store chooses which request to accept (if any). The commit-store rejects all other requests for that commit version and responds with error - this tells the client that some other writer has succeeded and it should rebase itself.”

Could this proposal allow for commit-manager to hand out non-consecutive commit ids? For the specific case of appending new data to a table (not modifying it, not depending on existing data) this works quite well. I feel like it’s a reasonable use case. Maybe commit-manager can take a flag that says “this commit doesn’t care about existing data and can happen at any time” to adopt this behavior?

prakharjain09 commented 3 months ago

Yes managed commits does not solve usecase mentioned in https://github.com/delta-io/delta-rs/discussions/2455#discussion-6564431 . Rebasing after a physical conflict is needed because things like rowids / inCommitTimstamp and schema evolution do in fact affect even supposedly blind appends.

That said, supporting blind appends better is an interesting use case to think about (in order to avoid quadratic cost in commit retries) -- irrespective of this proposal.

adriangb commented 3 months ago

That said, supporting blind appends better is an interesting use case to think about (in order to avoid quadratic cost in commit retries) -- irrespective of this proposal.

Agreed!

prakharjain09 commented 3 months ago

Renaming "managed commit" to "coordinated commit" to avoid ambiguity with "managed tables". This feature enables coordination with any kind of coordinator - catalogs, databases, key value stores, etc. so is completely independent of catalog-defined table types.