Closed alexdunnjpl closed 2 months ago
status: ancestry currently running on bigdata
Nearly complete (but fingers crossed)
Still running on our on-prem server.
After 124hrs, the sweeper has... finished iterating over the registry-refs pages. It's gonna take a minute.
Status: unfortunately an unrelated spike in memory usage on the on-prem box resulted in disk dumps occurring a little too aggressively. Like, >1M 25kB files in a single flat directory aggressively.
I've set a static dump threshold of 1GB and restarted the job, as the dir won't even list with that many files.
We're paying for half a terabyte of memory, you bet I'm gonna use it all.
Need to be re-launched.
We now need to launch from a different machine since the current one is torn down.
Awaiting availability of resources on mass-change-viz (currently running heavy ingestion - available in 24hrs)
It has not been possible to process PDS with sweeper on on-prem hosts. Now we want to:
Status: while investigating improvements, confusing behaviour was observed (shockingly-fast execution). Turns out that psa is undertaking an overhaul of their data and has deleted all(-ish) of their collections, so there's no representative high-volume registry, currently.
Despite this, attempts to run in ECS failed. Currently testing on local hardware (where ~750GB disk swap is available, unlike in ECS) to check what RAM/HDD requirements exist for the current psa registry.
@alexdunnjpl is progressing on the optimization.
A corner case came up, a ticket has been created for that.
Status: still working the corner-case implementation, but the bulk of the work is done and the remaining work is relatively straightforward
@alexdunnjpl is doing some manual test and the development should be wrapped up by the end of the week and the PSA processing should run over the week-end.
Currently running against psa-prod on local machine, results from smaller nodes are promising.
psa-prod memory utilization looks to be ~12GB based on current peak (scaling predominantly with the peak number of non-aggregates in a single LID-family of collections, as expected).
May've spoken a little soon - collection urn:esa:psa:em16_tgo_cas:data_raw
has 333k pages of members in registry-refs
...
2024-04-14 00:48:37,176::pds.registrysweepers.utils.db::INFO::Updated documents for 37963771 products!
2024-04-14 00:48:37,178::pds.registrysweepers.ancestry::INFO::Generating updates from deferred records...
2024-04-14 00:48:37,178::pds.registrysweepers.ancestry::INFO::Writing deferred updates to database...
2024-04-14 00:48:37,178::pds.registrysweepers.utils.db::INFO::Writing document updates...
2024-04-14 00:48:41,732::pds.registrysweepers.utils.db::INFO::Updated documents for 35 products!
2024-04-14 00:48:41,732::pds.registrysweepers.ancestry::INFO::Checking indexes for orphaned documents
2024-04-14 00:48:41,904::pds.registrysweepers.ancestry::WARNING::Detected 1373 orphaned documents in index "registry - please inform developers": <run with debug logging enabled to view list of orphaned lidvids>
2024-04-14 00:48:41,964::pds.registrysweepers.ancestry::WARNING::Detected 927283 orphaned documents in index "registry-refs - please inform developers": <run with debug logging enabled to view list of orphaned lidvids>
2024-04-14 00:48:41,964::pds.registrysweepers.ancestry::INFO::Ancestry sweeper processing complete!
2024-04-14 00:48:41,967::__main__::INFO::Sweepers successfully executed in 50h13m8s
pds.registrysweepers.ancestry: 50h13m8s
Whole-dataset run required ~45GB memory, though this is not the case now that the corpus has been processed. Peak memory requirements may be significantly lower if a container fails (having completed a chunk of work) and re-runs, hitting the failing (i.e. most-demanding) collection first.
@jordanpadams @tloubrieu-jpl given that we don't have rename-resolution functionality implemented in provenance, I'm almost certain we could implement a similar collection-iterative approach for that sweeper. This would remove the last portion of redundant reprocessing built into the sweeper and eliminate the vast majority of the remaining ECS cost associated with sweepers - 3.5hrs runtime per instance for psa-prod
I don't know whether that's significant enough to justify the development effort, but this would re-use a lot of the investigatory labor from the ancestry rework and take much less time, maybe as little as a couple of days.
Sweepers complete against psa-prod, 3m35m elapsed (all provenance).
@sjoshi-jpl I'll deploy the updated image to ECR now
💡 Description
If VPN proves too unstable to allow completion, just spin up an EC2 with 1TB block storage to handle this
blocks #98