delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.48k stars 1.68k forks source link

[Feature Request] Enable Clone of Delta Lake tables #1387

Open dennyglee opened 2 years ago

dennyglee commented 2 years ago

Feature request

Enable Clone of Delta Lake tables

Overview

Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.

Motivation

For business continuity disaster recovery to streamlining DevOps, cloning of Delta Lake tables

Further details

The context for this functionality can be found at https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-clone.html

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

p2bauer commented 2 years ago

Great that this is getting visibility, thank you @dennyglee. I think specifically deep clone functionality would be the most useful for some critical DRP scenarios.

That said, is this feature request encompassing the work to port existing functionality from core databricks offering to OSS? Or rather a new implementation from scratch?

dennyglee commented 2 years ago

I think so @p2bauer - I think there is still an open debate on which one makes more sense (port or design). Any particular thoughts on approach?

oakesk commented 1 year ago

It would be great to have the deep clone as @p2bauer suggests for DRP scenarios; in particular incremental clone/synchronize after the initial clone :+1:

armckinney commented 1 year ago

Hello, I see on the roadmap (https://github.com/delta-io/delta/issues/1307) that shallow clones have been added in 2.3 - is there still plans to add deep clones?



edit: removed alternative question.

I believe for the time being use are going to utilize something like:

%python

clone = (spark.read.format("delta") \
   .option("timestampAsOf", clone_timestamp.isoformat()) \ 
   .load(delta_table_path))

clone.write.format("delta").mode("errorifexists").save(clone_table_path)
sezruby commented 1 year ago

What about

  1. Get the list of files for the latest version
  2. Copy all the files, using same directory structure (e.g. /path/to/table/A=1/a.parquet should be copied to /path/to/backuptable/A=1/a.parquet)
  3. Copy /path/to/table/_delta_log dir to /path/to/backuptable/_delta_log

This is manual alternative of DEEP COPY for now.

It's not complete solution, for example, we don't need to copy all _delta_log directory. However, implementing this version would bring a lot more convenience for DRP.

armckinney commented 1 year ago

Interesting take. This type of approach will certainly be useful in the future for us I think. We are utilizing a 'DeltaStorageFormat' interface for our ingestion pipeliens currently and have been implementing our own features on top of Delta in this manner. I believe the next one for us coming up will be custom retention policies - i.e. the ability to define what versions to keep after a VACUUM process.

Aside for Databricks to consider implementing into Delta (currently our org just don't have the manpower to be able to contribute to the project at the moment in any meaningful way), and might drop hints at the DAIS 2023 this week:

I think for most organizations, this is typical as older data generally becomes stale and is only necessary to keep for CYA and auditing reasons. Thus, we would be looking to implement a fall-off policy, only keeping versions like: 1 version every year for past 7 years, 1 version every month for last year, 1 version every week for last 3 months, 1 version every day for last 30 days.

IoTier commented 10 months ago

Hi @dennyglee , Any idea when Deep Clone is going to be available for OSS Delta tables?