NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Non redundant provenance #101

Closed alexdunnjpl closed 5 months ago

alexdunnjpl commented 5 months ago

Rebased on #100 - consider only commit 3f94d339edc53d38098b01117c291655bab90266

🗒️ Summary

Implements #92

Modifies behaviour in that now, the latest version of a product will be assigned "ops:Provenance/ops:superseded_by": null rather than not having the attribute assigned at all.

Implements software-version-based reprocessing avoidance, as already exists for repairkit and ancestry.

Reads all documents, builds version chains for distinct LIDs, drops all singleton products (as no links exist), builds links, tainting any products with changed successor data, then produces updates, skipping up-to-date records unless they have been tainted.

⚙️ Test Data and/or Report

Functional tests pass, but none are relevant to provenance, per #13 Manually tested, comparing updates produced before/after change.

♻️ Related Issues

fixes #92

alexdunnjpl commented 5 months ago

Benchmarking against sbnpsi results in speed-up from 5m30s to 4m20s due to inherent speed improvements, but sbnpsi only has ~250 non-singleton products out of 1.5M total.

Results are likely to be significantly more impressive when it's actually avoiding a significant quantity of avoidable db writes .