NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Implement sweeper data versioning for repairkit #73

Closed alexdunnjpl closed 9 months ago

alexdunnjpl commented 9 months ago

🗒️ Summary

Implements a mechanism (in repairkit, but the approach is easily used elsewhere) to avoid redundant processing work. Previously, just no-op iterating through the initial query for repairkit was taking an inordinate amount of time. Now, only documents which haven't been updated with a version of repairkit GTE the current version are returned by the initial query.

For any sweeper, the sweeper version should be written as an integer to f"ops:Provenance/ops:registry_sweepers_{sweeper_name}_version". This version should be updated in the sweeper's constants submodule whenever a change is made to the sweeper which invalidates previous processing of documents.

Because it can only result from a code change in the first place, I've elected to hard-code the version in constants instead of using a configuration file, for simplicity.

Timeouts have been bumped to improve stability when run against prod registries. I've hardcoded them initially, but if they need to be changed a second time, I'll move them to a CLI argument.

There are additional unrelated changes/bugfixes included in this PR

⚙️ Test Data and/or Report

Manually tested against local, then en-prod (repairkit took 53min initially, and 0sec thereafter)

♻️ Related Issues

related to #61 fixes #70