delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Feature Request] Split merge and data access logic #3880

Open aurokk opened 1 week ago

aurokk commented 1 week ago

Feature request

Which Delta project/connector is this regarding?

Overview

Hi!

Currently merge in python delta API does two things:

  1. It executes business logic based on merge conditions
  2. It writes changes to s3/fs etc — data access

And it makes impossible to write good (fast & cheap) tests.


Currently an average piece of code looks like this:

def doSomething(DeltaTable deltaTable, DataFrame newDedupedLogs) -> void:

    deltaTable.alias("logs") \
        .merge( \
            newDedupedLogs.alias("newDedupedLogs"),  \
            "logs.uniqueId = newDedupedLogs.uniqueId"  \
        ) \
        .whenNotMatchedInsertAll() \
        .execute()  # note: it does save data right here

To test it we have only one option for now — it is to write integration tests which write data to s3/fs. It is very slow and expensive.


It would be nice to split these responsibilities and make it possible to merge without saving data. The code could look like this basically:

def doSomething(DataFrame target, DataFrame source) -> DataFrame:
    return DeltaTable \
        .merge( \ # note: static helper method
            target.alias("logs"),  \
            source.alias("newDedupedLogs"),  \
            "logs.uniqueId = newDedupedLogs.uniqueId" \
        )
        .whenNotMatchedInsertAll()  \
        .execute() # note: it returns DataFrame

And this code could be tested without handling 'side effects' like data access.

Motivation

It is important cuz makes testing cheaper and faster.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?