NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Investigate/implement non-redundant ancestry processing #91

Closed alexdunnjpl closed 5 months ago

alexdunnjpl commented 6 months ago

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that ECS costs are significantly reduced

📖 Additional Details

No response

Acceptance Criteria

Ancestry sweeper only processes data which is new, modified, or was processed with an out-of-date version of the ancestry sweeper, or which references a bundle or collection which has been modified.

⚙️ Engineering Details

No response

alexdunnjpl commented 5 months ago

Key question for whether this will work - are registry-refs docs guaranteed to be written to OpenSearch after the non-aggregate products they refer to?

@jordanpadams @al-niessner do you know, off the top of your head?

al-niessner commented 5 months ago

@alexdunnjpl

I was just looking at a related item in harvest a week or two ago. While I cannot say for certain because I have not tested it, harvest processes bundles then collections then products (non-aggs) and writes them in that order via batching and a List. So, just the opposite of the order you want.

I think the primary reason it is in this order is it simplifies testing and checking that harvest has to do. Since the bundle is already loaded, it knows if the collection is part of it. Ditto on next layer down. It makes the harvest code much simpler.

The order it is written is not as important. However it can be batched or done as found. Default is batch. If done as found, then order is obvious. I remember the batch using a list and sending to registry from first to last index. It would probably be easy, but no promises, to do the array in reverse. However, this would not help you if the user is not using batch mode.

alexdunnjpl commented 5 months ago

Thanks Al, much appreciated!

Will need to have a think about whether to follow this (harvest) up or rely on detection/cleanup of such cases. Given that all it would take to break something is for someone to use an out-of-date harvest even if we did fix it, seems like maybe the latter is the only option.

jordanpadams commented 5 months ago

@alexdunnjpl @al-niessner one catch here is that probably only applies when someone actually points at a bundle. Harvest can be pointed at any directory.

alexdunnjpl commented 5 months ago

Suggest (accepted, per breakout): implement naively, ignoring the "ingestion while sweeping" test case and monitor the quantity of orphaned documents or just check them in a few weeks/months. If there are an unmanageable quantity of orphans, we'll need to rethink, else implement a secondary cleanup sweeper process.