Non redundant ancestry - Githubissues

🗒️ Summary

Implements #91

During a processing run, db writes are skipped for bundle/collection documents which have already been processed with an up-to-date version of the ancestry software (per versioning tag, like used by repairkit).

All processing, read/compute/write, is skipped for non-aggregate product references belonging to a registry-refs page which has already been processed with an up-to-date version of the software.

Db writes are ordered such that it can be inferred that if an aggregate product has been tagged as up-to-date, all its descendants will also be up-to-date (assuming that re-harvesting a bundle or collection will overwrite its document, losing existing ancestry version metadata (@jordanpadams @tloubrieu-jpl @al-niessner is this a safe assumption, or do I need to check harvest's code?)

Once processing has completed, any products or registry-refs pages which do not indicate that they are up-to-date are counted and output in an ERROR log, indicating that some products were either harvested during sweeper processing or that some are getting missed and require a (much slower, yet-to-be-implemented) validation sweeper to correctly process.

Execution time for ancestry against sbnpsi is ~35min. With the optimisations it's <2sec on subsequent runs. For nodes like psa which have a million aggregate products this should not be expected, but it should still cut ancestry runtime to 0.1-1% of previous duration.

The only caveat here is that progress is only made if execution completes - if ancestry repeatedly fails mid-execution due to resource issues, it won't make incremental progress and eventually succeed. This means that it's probably best for me to perform the first run against each node on a local machine with plenty of disk space, to avoid the need to allocate unnecessarily-large storage for ECS. This process will need to be repeated if/when the ancestry software version is incremented.

@jordanpadams @tloubrieu-jpl Further optimization to make incremental process, avoiding this caveat, is possible and may be desirable, it just requires a less-naive approach to ordering/streaming the updates.

c1p1_nonaggs
c1p2_nonaggs
c1_refs_pages
c2p1_nonaggs
c2p2_nonaggs
c2_refs_pages
...

instead of

nonaggs
refs_pages
collections
bundles

Please open/triage a ticket for this work if that seems warranted.

⚙️ Test Data and/or Report

Functional tests pass. New changes tested manually, final manual test in-progress.

♻️ Related Issues

fixes #91

NASA-PDS / registry-sweepers

Non redundant ancestry #100

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues