An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Currently merge in python delta API does two things:
It executes business logic based on merge conditions
It writes changes to s3/fs etc — data access
And it makes impossible to write good (fast & cheap) tests.
Currently an average piece of code looks like this:
def doSomething(DeltaTable deltaTable, DataFrame newDedupedLogs) -> void:
deltaTable.alias("logs") \
.merge( \
newDedupedLogs.alias("newDedupedLogs"), \
"logs.uniqueId = newDedupedLogs.uniqueId" \
) \
.whenNotMatchedInsertAll() \
.execute() # note: it does save data right here
To test it we have only one option for now — it is to write integration tests which write data to s3/fs.
It is very slow and expensive.
It would be nice to split these responsibilities and make it possible to merge without saving data.
The code could look like this basically:
And this code could be tested without handling 'side effects' like data access.
Motivation
It is important cuz makes testing cheaper and faster.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
[ ] Yes. I can contribute this feature independently.
[ ] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
[x] No. I cannot contribute this feature at this time.
Feature request
Which Delta project/connector is this regarding?
Overview
Hi!
Currently
merge
in python delta API does two things:And it makes impossible to write good (fast & cheap) tests.
Currently an average piece of code looks like this:
To test it we have only one option for now — it is to write integration tests which write data to s3/fs. It is very slow and expensive.
It would be nice to split these responsibilities and make it possible to merge without saving data. The code could look like this basically:
And this code could be tested without handling 'side effects' like data access.
Motivation
It is important cuz makes testing cheaper and faster.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?