NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Regularly synchronize the legacy registry collections on Solr in the registry in opensearch #58

Closed tloubrieu-jpl closed 9 months ago

tloubrieu-jpl commented 10 months ago

💡 Description

tloubrieu-jpl commented 10 months ago

@alexdunnjpl @al-niessner @sjoshi-jpl , I implemented the code to copy from the legacy registry on SOLR into a new index legacy_registry on opensearch.

I would like that script to run only on the EN_PROD domain. Do you think that is ok to add a if statement in the sweeper_driver there https://github.com/NASA-PDS/registry-sweepers/blob/1a92b530c9e7a29d0b79e7afbfbc559dae4f3d0c/docker/sweepers_driver.py#L110 or do you have another idea ?

Thanks

alexdunnjpl commented 10 months ago

@tloubrieu-jpl if this is a temporary/non-long-term task, I'd recommend running it separately (i.e. separate task and schedule) rather than as part of registry-sweepers. Is this even something that needs to be run periodically (and is it coded in such a way that it won't burn a bunch of compute time unnecessarily)?

If it should be part of sweepers, I'd suggest implementing an argparser option --enable-legacy-solr-import-sweeper or similar, defaulting to False, which is used as the condition for an if block running that sweeper, in the driver. That way @sjoshi-jpl can just add that option to the invoked docker command for the en-prod task definition.

tloubrieu-jpl commented 10 months ago

Thanks @alexdunnjpl , I guess it is a temporary task that might last for ne or 2 years.

That should be run periodically but not as often as the other registry-sweeper tasks maybe.

I will add the option as you suggested.

alexdunnjpl commented 10 months ago

@tloubrieu-jpl in this context, a couple of years is plenty to consider it non-temporary.

Given what you've said about it not running as often as other sweeper tasks, I'd suggest instead creating a second driver script (copy the existing one and give it a more-specific name) which is just for running your solr legacy script.

There is no way to decouple the cadences while using the same driver script.

You would need to double-check the Dockerfile to ensure that the script is copied into the image (probably have it copy all *.py in that directory, for future-proofing).

If we need extra configurability and/or prefer use of a single driver script, we could have all individual sweepers be opt-in via CLI flags, though existing task definitions for provenance/ancestry would need to be updated to use them.

sjoshi-jpl commented 10 months ago

@tloubrieu-jpl here is the ECS Schedule override we spoke about. We can test it after you're done making chagnes.

Schedule: EN-PROD Overrides:

{
  "containerOverrides": [{
    "name":"pds-en-prod-registry-sweeper-container",
    "command":["--enable-legacy-solr-import-sweeper"]
  }] 
}