NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Investigate/implement non-redundant provenance processing #92

Closed alexdunnjpl closed 5 months ago

alexdunnjpl commented 6 months ago

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that ECS costs are significantly reduced

📖 Additional Details

No response

Acceptance Criteria

Provenance sweeper only processes data which is new, modified, or was processed with an out-of-date version of the ancestry sweeper, or which is tightly-coupled (definition TBD) to a document which has been modified

Given: a registry opensearch database When I do: add a new product in the opensearch database (with harvest) I expect: sweeper do only process the new product

Given: a registry opensearch database When I do: update manually the harvest_time of a product in the opensearch database I expect: sweeper do only process the updated product

Given: a registry opensearch database When I do: update manually the sweeper version of a product in the opensearch database I expect: sweeper do only process the updated product

⚙️ Engineering Details

No response

tloubrieu-jpl commented 5 months ago

good progress on this ticket.

alexdunnjpl commented 5 months ago

Wrong ticket, that'd be #91 - this'n hasn't been started yet.

@jordanpadams @tloubrieu-jpl can we make the assumption that versions of products will be inserted in chronological order? That is to say (assuming the sweeper never failed), if version V is in the registry at some point in time, all versions <V are guaranteed to also be in the registry?

I'm guessing we can't, but doesn't hurt to ask.

jordanpadams commented 5 months ago

@alexdunnjpl

if version V is in the registry at some point in time, all versions <V are guaranteed to also be in the registry?

No. There are data products produced by IMG that version based on the Ops pipeline, but only some of the Ops products are actually released. So they can have a product version 30.0, but only have 4 versions in the archive.

alexdunnjpl commented 5 months ago

@jordanpadams to be more specific, I mean "is it guaranteed that no version <V will be written into the registry at a later date?"

jordanpadams commented 5 months ago

@alexdunnjpl no. because we are creating these tools after numerous versions already exist for this data, nodes are often just loading the latest. eventually we will push on them to load past versions.

alexdunnjpl commented 5 months ago

@jordanpadams roger that, thanks!

This is fine, it just means that there's an additional candidate optimisation which isn't possible.

As it stands, local benchmarks indicate that provenance should take approximately 4min per 1M archived/certified products when not processing newly-harvested data. I imagine that in ECR it should be a little faster.

repairkit and ancestry should complete immediately when not processing newly-harvested data, so that's about as good a job as we can do.

tloubrieu-jpl commented 3 months ago

@gxtchen , this can be tested by using the docker compose deployment of the full registry.

Start it like that:

docker compose --profile=int-registry-batch-loader up -d

It runs sweeper once by default.

After you update the registry database, you can re-run sweeper in a different terminal by:

  1. adding a tag "sweepers" to the sweeper service in the docker-compose.yaml file.
  2. launching the command: docker compose --profile=sweepers up
tloubrieu-jpl commented 2 months ago

To update manually a document in the registry database, @gxtchen you can use https://opensearch.org/docs/1.0/opensearch/rest-api/document-apis/update-document/#:~:text=If%20you%20need%20to%20update,runs%20to%20update%20the%20document.