NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

improve repairkit sweeper performance #62

Closed alexdunnjpl closed 10 months ago

alexdunnjpl commented 10 months ago

🗒️ Summary

Improves performance of repairkit by utilizing _bulk write endpoint properly, allowing it to be feasibly run on large nodes.

Implements lazy/generator-compatible handling of updates, which should either improve memory performance outright or facilitate subsequent memory optimisation efforts.

Adds query ids to log messages, allowing disambiguation of progress updates during nested/long-running queries.

Tweaks sweepers-driver, as I've confused the execution order with the factory declaration order like five times.

⚙️ Test Data and/or Report

provenance/ancestry tested heuristically - they take slightly-less time compared to pre-change sweepers (minor speedup combined with retention of no-op updates, which are faster than if data were being changed), and small-volume spot tests of documents show presence of correct metadata

ancestry tested with functional tests - passing

repairkit currently untested - need to test against a db with unrepaired data or an ad-hoc local db with unrepaired data to be absolutely certain, though the potential for error is very low and will be easily detectable by comparing initial vs subsequent runs (whether or not writes stop happening after data is fixed)

♻️ Related Issues

required for completion of #61

@jordanpadams @tloubrieu-jpl @sjoshi-jpl I strongly suggest consideration of a metadata flag containing the most-recent version of repairkit or registry-sweepers which has touched a document. Just iterating over pds-en (resulting in no updates) takes half an hour every time sweepers executes, which could be avoided entirely by filtering the resultset db-side to just those documents which haven't been hit with a current version of repairkit. I know we've talked about that previously but I don't know if we have a ticket for it yet.

@nutjob4life any smart ideas for how to resolve the version? Candidates off the top of my head are:

alexdunnjpl commented 10 months ago

@tloubrieu-jpl @nutjob4life any recent changes to docs generation I should know about? (re the failing branch test)

nutjob4life commented 10 months ago

Hi @alexdunnjpl given that adding a "column" to a document is relatively inexpensive I think it'd be certainly fine to have a repairkit/sweeper version tag that tells what release of the software touched it most recently. I'd make it a plain integer though, mostly since open search doesn't have a "version" field type, and indexing and comparisons would be cheap.

I hate to hearken back to Plone but those folks did a few things right—and one was separating the notion of software version from data version (what they called a "profile" version). You might change software in ways that doesn't affect stored data at all (fixing a typo in a help message, for example). But then you might change both. They kept the profile version in a separate file (but still part of the Python package).

So we've got VERSION.txt for software version like 1.1.0. We could add a PROFILE.txt with something like 3 in it. It's a one-time read at start up, right?

alexdunnjpl commented 10 months ago

So we've got VERSION.txt for software version like 1.1.0. We could add a PROFILE.txt with something like 3 in it. It's a one-time read at start up, right?

@nutjob4life this sounds good - kinda reminds me of incremental db migrations.

We still have the issue of "dev forgets to increment the value" but I don't see us getting away from that and the performance benefits are pretty persuasive. We could also throw a nice big obvious log message (with the sweepers profile version in use) at execution start to aid in detecting that situation.

Resolution is a once-per-execution thing, yes, so anything up to a full minute is viable.

alexdunnjpl commented 10 months ago

Good to merge as soon as @sjoshi-jpl and @alexdunnjpl have tested in-situ

alexdunnjpl commented 10 months ago

@tloubrieu-jpl did you intend to merge this before we'd tested it in-situ?