Closed alexdunnjpl closed 5 months ago
sbnpsi benchmark, 1.5M product, execution dropped from 35min to 57sec. This is at least partly because orphaned documents are continually reprocessed, and sbnpsi has ~5k remaining after processing due to missing collections.
It's unclear why it's taking 1/35th of the time despite only reprocessing 1/300th of the document corpus.
🗒️ Summary
Implements #91
During a processing run, db writes are skipped for bundle/collection documents which have already been processed with an up-to-date version of the ancestry software (per versioning tag, like used by repairkit).
All processing, read/compute/write, is skipped for non-aggregate product references belonging to a registry-refs page which has already been processed with an up-to-date version of the software.
Db writes are ordered such that it can be inferred that if an aggregate product has been tagged as up-to-date, all its descendants will also be up-to-date (assuming that re-harvesting a bundle or collection will overwrite its document, losing existing ancestry version metadata (@jordanpadams @tloubrieu-jpl @al-niessner is this a safe assumption, or do I need to check harvest's code?)
Once processing has completed, any products or registry-refs pages which do not indicate that they are up-to-date are counted and output in an
ERROR
log, indicating that some products were either harvested during sweeper processing or that some are getting missed and require a (much slower, yet-to-be-implemented) validation sweeper to correctly process.Execution time for ancestry against sbnpsi is ~35min. With the optimisations it's <2sec on subsequent runs. For nodes like psa which have a million aggregate products this should not be expected, but it should still cut ancestry runtime to 0.1-1% of previous duration.
The only caveat here is that progress is only made if execution completes - if ancestry repeatedly fails mid-execution due to resource issues, it won't make incremental progress and eventually succeed. This means that it's probably best for me to perform the first run against each node on a local machine with plenty of disk space, to avoid the need to allocate unnecessarily-large storage for ECS. This process will need to be repeated if/when the ancestry software version is incremented.
@jordanpadams @tloubrieu-jpl Further optimization to make incremental process, avoiding this caveat, is possible and may be desirable, it just requires a less-naive approach to ordering/streaming the updates.
instead of
Please open/triage a ticket for this work if that seems warranted.
⚙️ Test Data and/or Report
Functional tests pass. New changes tested manually, final manual test in-progress.
♻️ Related Issues
fixes #91