Open sor-droneup opened 11 months ago
hi, We are also facing the same issue. Is there any workaround / patch can be made available?
@johanl-db Could you please take a look at this issue?
The following may work: Given:
MERGE INTO target
USING source
ON merge_condition
WHEN NOT MATCHED BY SOURCE AND <not_matched_by_source_condition_1> THEN ...
WHEN NOT MATCHED BY SOURCE AND <not_matched_by_source_condition_2> THEN ...
If merge_condition
and all not_matched_by_source_condition_N
share the same target-only predicate (typically a partition filter), then we can apply that predicate when filtering target files in findTouchedFiles.
One way of doing it would be to extract the target-only predicates from the merge condition and each NOT MATCHED Y SOURCE conditions using splitConjunctivePredicates
and OR them together to only filter files that match all of them.
The implementation should be short but will require careful testing to ensure:
I have limited bandwidth to handle this at the moment unfortunately but happy to provide support and review for external contributions
Feature request
Which Delta project/connector is this regarding?
Overview
Introduce support for ensuring partition disjoint in the conditions provided to the "whenNotMatchedBySource" group of operations.
In the context of merge operations belonging to the "whenNotMatchedBySource" family, any conditions that would normally guarantee partition disjoint are currently ignored, resulting in a concurrency error being raised.
To address this, it is possible to include a partition disjoint statement within the join condition (e.g., "class=1"), which enables concurrent updates across multiple partitions.
All operations within the "WhenMatched" and "WhenNotMatched" groups are now designed to ensure concurrency safety. However, if you incorporate "whenNotMatchedBySourceDelete," Spark will throw a ConcurrentAppendException.
Motivation
Adding concurrency disjont support to whenNotMatchedBy will increase the utility for delta lake tables.
Further details
An example scenario: 1) my source is non-delta file (eg. json/parquet) 2) This file sources one of the partitions within my delta table 3) I want to merge into specified partition in my table. Within this partition i want to run a query that upserts the data and removes rows that not exist any more within source (basically refresh). To do that I have currently 2 options:
The code snipped that simulates this behaviour:
In example above although the condition category was specified that would make this operation partition scoped, it is ignored since whenNotMatchedBySource is considered as non-partition-scoped.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?