Open dennyglee opened 2 years ago
Great that this is getting visibility, thank you @dennyglee. I think specifically deep clone
functionality would be the most useful for some critical DRP scenarios.
That said, is this feature request encompassing the work to port existing functionality from core databricks offering to OSS? Or rather a new implementation from scratch?
I think so @p2bauer - I think there is still an open debate on which one makes more sense (port or design). Any particular thoughts on approach?
It would be great to have the deep clone
as @p2bauer suggests for DRP scenarios; in particular incremental clone/synchronize after the initial clone :+1:
Hello, I see on the roadmap (https://github.com/delta-io/delta/issues/1307) that shallow clones have been added in 2.3 - is there still plans to add deep clones?
edit: removed alternative question.
I believe for the time being use are going to utilize something like:
%python
clone = (spark.read.format("delta") \
.option("timestampAsOf", clone_timestamp.isoformat()) \
.load(delta_table_path))
clone.write.format("delta").mode("errorifexists").save(clone_table_path)
operation
being WRITE
instead of CLONE
CREATE TABLE ...
syntax allows us to not enable Hive.What about
This is manual alternative of DEEP COPY for now.
It's not complete solution, for example, we don't need to copy all _delta_log directory. However, implementing this version would bring a lot more convenience for DRP.
Interesting take. This type of approach will certainly be useful in the future for us I think. We are utilizing a 'DeltaStorageFormat' interface for our ingestion pipeliens currently and have been implementing our own features on top of Delta in this manner. I believe the next one for us coming up will be custom retention policies - i.e. the ability to define what versions to keep after a VACUUM process.
Aside for Databricks to consider implementing into Delta (currently our org just don't have the manpower to be able to contribute to the project at the moment in any meaningful way), and might drop hints at the DAIS 2023 this week:
I think for most organizations, this is typical as older data generally becomes stale and is only necessary to keep for CYA and auditing reasons. Thus, we would be looking to implement a fall-off policy, only keeping versions like: 1 version every year for past 7 years, 1 version every month for last year, 1 version every week for last 3 months, 1 version every day for last 30 days.
Hi @dennyglee , Any idea when Deep Clone
is going to be available for OSS Delta tables?
Feature request
Enable Clone of Delta Lake tables
Overview
Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.
Motivation
For business continuity disaster recovery to streamlining DevOps, cloning of Delta Lake tables
Further details
The context for this functionality can be found at https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-clone.html
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?