Netflix / genie

Distributed Big Data Orchestration Service
https://netflix.github.io/genie
Apache License 2.0
1.7k stars 365 forks source link

Files deletion cleanup improvements #1184

Closed xiao-chen closed 1 year ago

xiao-chen commented 1 year ago

Break down the file deletion clean up task to perform clean up and look up in smaller trunks.

codecov[bot] commented 1 year ago

Codecov Report

Merging #1184 (46a3855) into master (4fa5f5c) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #1184      +/-   ##
============================================
+ Coverage     90.73%   90.74%   +0.01%     
- Complexity     3796     3798       +2     
============================================
  Files           467      467              
  Lines         14933    14951      +18     
  Branches       1068     1069       +1     
============================================
+ Hits          13549    13567      +18     
  Misses          908      908              
  Partials        476      476              
Impacted Files Coverage Δ
...a/services/impl/jpa/JpaPersistenceServiceImpl.java 96.48% <100.00%> (ø)
...enie/web/properties/DatabaseCleanupProperties.java 100.00% <100.00%> (ø)
...ix/genie/web/tasks/leader/DatabaseCleanupTask.java 89.02% <100.00%> (+0.68%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

enicloom commented 1 year ago

The implementation LGTM. Performance tests are needed before merge. This change does break the consistency guarantee during file cleanup since the job submitted during the cleanup process can experience inconsistent states between submitted and executed.

enicloom commented 1 year ago

This change does break the consistency guarantee during file cleanup since the job submitted during the cleanup process can experience inconsistent states between submitted and executed.

This is incorrect. It won't break the consistency guarantee for jobs submitted then executed between cleanups.

xiao-chen commented 1 year ago

Thank you @enicloom , yes this will be consistent, but delete (and lookup for deletion) in smaller batches. The distinct queries are still looking for all, which guarantees consistency. Those tables are small enough to not cause problems, but if they do, then a different approach would be needed.

Did performance testing offline on mysql, performance insights looking good.