NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Run sweepers locally against PSA prod #108

Closed alexdunnjpl closed 2 months ago

alexdunnjpl commented 4 months ago

💡 Description

If VPN proves too unstable to allow completion, just spin up an EC2 with 1TB block storage to handle this

blocks #98

alexdunnjpl commented 4 months ago

status: ancestry currently running on bigdata

nutjob4life commented 4 months ago

Nearly complete (but fingers crossed)

tloubrieu-jpl commented 4 months ago

Still running on our on-prem server.

alexdunnjpl commented 4 months ago

After 124hrs, the sweeper has... finished iterating over the registry-refs pages. It's gonna take a minute.

alexdunnjpl commented 3 months ago

Status: unfortunately an unrelated spike in memory usage on the on-prem box resulted in disk dumps occurring a little too aggressively. Like, >1M 25kB files in a single flat directory aggressively.

I've set a static dump threshold of 1GB and restarted the job, as the dir won't even list with that many files.

image

We're paying for half a terabyte of memory, you bet I'm gonna use it all.

tloubrieu-jpl commented 3 months ago

Need to be re-launched.

tloubrieu-jpl commented 3 months ago

We now need to launch from a different machine since the current one is torn down.

alexdunnjpl commented 3 months ago

Awaiting availability of resources on mass-change-viz (currently running heavy ingestion - available in 24hrs)

tloubrieu-jpl commented 3 months ago

It has not been possible to process PDS with sweeper on on-prem hosts. Now we want to:

  1. manage the memory in a smarter way in the code base to avoid out of memories
  2. run on PSA step by step to avoid loss of work previously done, by saving intermediate on files and splitting steps by collections.
alexdunnjpl commented 3 months ago

Status: while investigating improvements, confusing behaviour was observed (shockingly-fast execution). Turns out that psa is undertaking an overhaul of their data and has deleted all(-ish) of their collections, so there's no representative high-volume registry, currently.

Despite this, attempts to run in ECS failed. Currently testing on local hardware (where ~750GB disk swap is available, unlike in ECS) to check what RAM/HDD requirements exist for the current psa registry.

tloubrieu-jpl commented 2 months ago

@alexdunnjpl is progressing on the optimization.

tloubrieu-jpl commented 2 months ago

A corner case came up, a ticket has been created for that.

alexdunnjpl commented 2 months ago

Status: still working the corner-case implementation, but the bulk of the work is done and the remaining work is relatively straightforward

tloubrieu-jpl commented 2 months ago

@alexdunnjpl is doing some manual test and the development should be wrapped up by the end of the week and the PSA processing should run over the week-end.

alexdunnjpl commented 2 months ago

Currently running against psa-prod on local machine, results from smaller nodes are promising.

alexdunnjpl commented 2 months ago

psa-prod memory utilization looks to be ~12GB based on current peak (scaling predominantly with the peak number of non-aggregates in a single LID-family of collections, as expected).

alexdunnjpl commented 2 months ago

May've spoken a little soon - collection urn:esa:psa:em16_tgo_cas:data_raw has 333k pages of members in registry-refs...

alexdunnjpl commented 2 months ago
2024-04-14 00:48:37,176::pds.registrysweepers.utils.db::INFO::Updated documents for 37963771 products!
2024-04-14 00:48:37,178::pds.registrysweepers.ancestry::INFO::Generating updates from deferred records...
2024-04-14 00:48:37,178::pds.registrysweepers.ancestry::INFO::Writing deferred updates to database...
2024-04-14 00:48:37,178::pds.registrysweepers.utils.db::INFO::Writing document updates...
2024-04-14 00:48:41,732::pds.registrysweepers.utils.db::INFO::Updated documents for 35 products!
2024-04-14 00:48:41,732::pds.registrysweepers.ancestry::INFO::Checking indexes for orphaned documents
2024-04-14 00:48:41,904::pds.registrysweepers.ancestry::WARNING::Detected 1373 orphaned documents in index "registry - please inform developers": <run with debug logging enabled to view list of orphaned lidvids>
2024-04-14 00:48:41,964::pds.registrysweepers.ancestry::WARNING::Detected 927283 orphaned documents in index "registry-refs - please inform developers": <run with debug logging enabled to view list of orphaned lidvids>
2024-04-14 00:48:41,964::pds.registrysweepers.ancestry::INFO::Ancestry sweeper processing complete!
2024-04-14 00:48:41,967::__main__::INFO::Sweepers successfully executed in 50h13m8s
   pds.registrysweepers.ancestry: 50h13m8s
alexdunnjpl commented 2 months ago

Whole-dataset run required ~45GB memory, though this is not the case now that the corpus has been processed. Peak memory requirements may be significantly lower if a container fails (having completed a chunk of work) and re-runs, hitting the failing (i.e. most-demanding) collection first.

alexdunnjpl commented 2 months ago

@jordanpadams @tloubrieu-jpl given that we don't have rename-resolution functionality implemented in provenance, I'm almost certain we could implement a similar collection-iterative approach for that sweeper. This would remove the last portion of redundant reprocessing built into the sweeper and eliminate the vast majority of the remaining ECS cost associated with sweepers - 3.5hrs runtime per instance for psa-prod

I don't know whether that's significant enough to justify the development effort, but this would re-use a lot of the investigatory labor from the ancestry rework and take much less time, maybe as little as a couple of days.

alexdunnjpl commented 2 months ago

Sweepers complete against psa-prod, 3m35m elapsed (all provenance).

@sjoshi-jpl I'll deploy the updated image to ECR now