apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
919 stars 147 forks source link

[Enhancement] Support restore/rollback sync during conversion (1/2) #569

Open danielhumanmod opened 3 weeks ago

danielhumanmod commented 3 weeks ago

Important Read

What is the purpose of the pull request

Previously, if a rollback/restore occurred in the source table, XTable would reflect it as file changes (added or deleted) in the target table. In this PR, we aim to improve this by issuing a rollback command in the target tables, ensuring more consistent histories between the source and target. This approach is also more efficient, as it allows us to restore directly to a specific version/snapshot instead of computing a large diff against the table’s current state.

This is the first part of this enhancement (1/2), focusing primarily on detecting whether a rollback/restore occurred in the source table and verifying if the corresponding commit exists in the target table.

Brief change log

  1. Add a source identifier in target transaction
    • snapshot ID in Iceberg, version ID in Delta, and instant in Hudi
  2. [Source] Detect rollback and get the rollbackSnapshot from source table
  3. [Target] Verify if the target table contains the corresponding commit based on source identifier

Additional Info

Source Identifier

The source identifier represent the mapping between source and target format, means we could use the source identifier to find corresponding target COMMIT

Fallback scenarios

Fallback will happen when a rollback or restore is detected in the source table, but the corresponding commit is not found in the target table. We will still leverage the rollback information from the source, but this round of sync will be treated as file changes in the target table, following the previous behavior.

Here’s an example:

Iceberg (Source)          Delta (Target)  
┌────────────┐      ┌─────────────────────┐
│ Snapshot 0 │ ◀  ▶ │ Version 0 (Synced)  │  
│ Snapshot 1 │      │                     │  
│ Snapshot 2 │      │                     │  
│ Snapshot 3 │      │                     │  
│ Snapshot 4 │      │                     │  
│ Snapshot 5 │ ◀  ▶ │ Version 1 (Synced)  │
└────────────┘      └─────────────────────┘  

In this case, we can not guarantee complete metadata consistency between the source and target, but it helps reduce some computation.

Verify this pull request

This pull request is already covered by existing tests, all existing tests should pass

danielhumanmod commented 1 day ago

Hi @the-other-tim-brown, based on my investigation, both Iceberg and Delta support storing commit-level information, but we might need to adjust our current code. Here’s a summary of the findings:

To align with these capabilities, some code adjustments may be needed for both Iceberg and Delta. I’ll start working on a proof of concept to explore this, and will get back to you once it’s completed.