MrPowers / mack

Delta Lake helper methods in PySpark
https://mrpowers.github.io/mack/
MIT License
286 stars 42 forks source link

Fix for Issue #2 #113

Closed holden-herrell closed 9 months ago

holden-herrell commented 1 year ago

These changes were made with assistance from @Amrit-Hub's comment about the upsert logic. Existing logic assumes that if there is a matching primary key and match columns have changed then update, else insert. This assumes we won't pass an existing PK with unchanged values in match columns, which could lead to duplicate records being inserted each run of the upsert. By altering the logic to only consider the PK as the match condition we can push the logic for updating further down and avoid unintentionally re-inserting records with same PK/match-column combination that were unchanged from last upsert attempt.

MrPowers commented 9 months ago

Sorry for not reviewing this earlier. Looks like this was fixed in this commit. Sorry again for being so slow on this one.