Implements a mechanism (in repairkit, but the approach is easily used elsewhere) to avoid redundant processing work. Previously, just no-op iterating through the initial query for repairkit was taking an inordinate amount of time. Now, only documents which haven't been updated with a version of repairkit GTE the current version are returned by the initial query.
For any sweeper, the sweeper version should be written as an integer to f"ops:Provenance/ops:registry_sweepers_{sweeper_name}_version". This version should be updated in the sweeper's constants submodule whenever a change is made to the sweeper which invalidates previous processing of documents.
Because it can only result from a code change in the first place, I've elected to hard-code the version in constants instead of using a configuration file, for simplicity.
Timeouts have been bumped to improve stability when run against prod registries. I've hardcoded them initially, but if they need to be changed a second time, I'll move them to a CLI argument.
There are additional unrelated changes/bugfixes included in this PR
demote a noisy log
minor refactoring/de-crufting
fix an erroneous error log
⚙️ Test Data and/or Report
Manually tested against local, then en-prod (repairkit took 53min initially, and 0sec thereafter)
🗒️ Summary
Implements a mechanism (in repairkit, but the approach is easily used elsewhere) to avoid redundant processing work. Previously, just no-op iterating through the initial query for repairkit was taking an inordinate amount of time. Now, only documents which haven't been updated with a version of repairkit GTE the current version are returned by the initial query.
For any sweeper, the sweeper version should be written as an integer to
f"ops:Provenance/ops:registry_sweepers_{sweeper_name}_version"
. This version should be updated in the sweeper'sconstants
submodule whenever a change is made to the sweeper which invalidates previous processing of documents.Because it can only result from a code change in the first place, I've elected to hard-code the version in
constants
instead of using a configuration file, for simplicity.Timeouts have been bumped to improve stability when run against prod registries. I've hardcoded them initially, but if they need to be changed a second time, I'll move them to a CLI argument.
There are additional unrelated changes/bugfixes included in this PR
⚙️ Test Data and/or Report
Manually tested against local, then en-prod (repairkit took 53min initially, and 0sec thereafter)
♻️ Related Issues
related to #61 fixes #70