Nike-Inc / spark-expectations

A Python Library to support running data quality rules while the spark job is running⚡
https://engineering.nike.com/spark-expectations
Apache License 2.0
148 stars 32 forks source link

[Feature] updating code to work for retries if the save table fails #61

Closed asingamaneni closed 7 months ago

asingamaneni commented 7 months ago

Description

Making code changes so as to support retries for writing into table if it fails. Update table properties only if the property is not already written.

Related Issue

60

Motivation and Context

When there are multiple processes writing at the same time into the stats table, having issues with concurrent writes. Adding retries and also write table properties only when not written.

How Has This Been Tested?

The code has been unit tested

Screenshots (if appropriate):

image

Types of changes

Checklist:

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (5e6c601) 100.00% compared to head (f4e5876) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #61 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 22 22 Lines 1441 1447 +6 ========================================= + Hits 1441 1447 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Umeshsp22 commented 7 months ago

LGTM

phanikumarvemuri commented 7 months ago

I see we are retrying the write irrespective of exception caused. Should we not retry only when it failed for concurrent writes ? Also as per delta Lake doc multiple Inserts should not conflict. Can you please share the exception details ( the stack trace ) ?