NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Non redundant ancestry #100

Closed alexdunnjpl closed 5 months ago

alexdunnjpl commented 5 months ago

🗒️ Summary

Implements #91

During a processing run, db writes are skipped for bundle/collection documents which have already been processed with an up-to-date version of the ancestry software (per versioning tag, like used by repairkit).

All processing, read/compute/write, is skipped for non-aggregate product references belonging to a registry-refs page which has already been processed with an up-to-date version of the software.

Db writes are ordered such that it can be inferred that if an aggregate product has been tagged as up-to-date, all its descendants will also be up-to-date (assuming that re-harvesting a bundle or collection will overwrite its document, losing existing ancestry version metadata (@jordanpadams @tloubrieu-jpl @al-niessner is this a safe assumption, or do I need to check harvest's code?)

Once processing has completed, any products or registry-refs pages which do not indicate that they are up-to-date are counted and output in an ERROR log, indicating that some products were either harvested during sweeper processing or that some are getting missed and require a (much slower, yet-to-be-implemented) validation sweeper to correctly process.

Execution time for ancestry against sbnpsi is ~35min. With the optimisations it's <2sec on subsequent runs. For nodes like psa which have a million aggregate products this should not be expected, but it should still cut ancestry runtime to 0.1-1% of previous duration.

The only caveat here is that progress is only made if execution completes - if ancestry repeatedly fails mid-execution due to resource issues, it won't make incremental progress and eventually succeed. This means that it's probably best for me to perform the first run against each node on a local machine with plenty of disk space, to avoid the need to allocate unnecessarily-large storage for ECS. This process will need to be repeated if/when the ancestry software version is incremented.

@jordanpadams @tloubrieu-jpl Further optimization to make incremental process, avoiding this caveat, is possible and may be desirable, it just requires a less-naive approach to ordering/streaming the updates.

c1p1_nonaggs
c1p2_nonaggs
c1_refs_pages
c2p1_nonaggs
c2p2_nonaggs
c2_refs_pages
...

instead of

nonaggs
refs_pages
collections
bundles

Please open/triage a ticket for this work if that seems warranted.

⚙️ Test Data and/or Report

Functional tests pass. New changes tested manually, final manual test in-progress.

♻️ Related Issues

fixes #91

alexdunnjpl commented 5 months ago

sbnpsi benchmark, 1.5M product, execution dropped from 35min to 57sec. This is at least partly because orphaned documents are continually reprocessed, and sbnpsi has ~5k remaining after processing due to missing collections.

It's unclear why it's taking 1/35th of the time despite only reprocessing 1/300th of the document corpus.