39 - implement ancestry chunking memory optimization

NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information

Apache License 2.0

0 stars 1 forks source link

🗒️ Summary

Implements a new chunked processing paradigm for ancestry sweeper to reduce provisioned memory requirement at the cost of some runtime speed and the need to provision enough disk space to perform swapping. The disk dump avoids the need to hold the full ancestry history in memory to generate AncestryRecords.

Performance tuning is available via env vars ANCESTRY_NONAGGREGATE_QUERY_PAGE_SIZE (default 2000) and ANCESTRY_DISK_DUMP_MEMORY_THRESHOLD (expressed in percent, default 80). Further details are available in ancestry.runtimeconstants.py

Performance test on en-prod resulted in ability to drop from 1vCPU and 4GB memory to 0.5vCPU and 1GB memory (cost savings of 42%), at the cost of 31% increased runtime. Proportional benefit is expected to be significantly higher for nodes with greater quantities of data as higher memory allocations require more vCPUs, all except one of which are wasted, though this will require confirmation. This also, critically, removes the inability to scale to large quantities of products due to finite ability to increase provisioned RAM.

⚙️ Test Data and/or Report

Regression tests augmented/updated/passed. Functional tests implemented for new dump/merge behaviour to ensure production of correct update content. Manually tested against en-prod with a dry-run and shown to produce equal update count to pre-PR version. Performance-tested against en-prod.

~N.B. Docs are failing due to sphinx issue (locally and in github actions)~ not anymore

♻️ Related Issues

fixes #39

Ops notes

I switched ECS task definitions to use tag stable rather than latest, thinking that latest would automatically reference the last-pushed image (and we don't want a dev pushing a development-related tag and causing all tasks to use that by default), but this appears not to be the case. Is latest a manual tag rather than something auto-generated? If so, task definitions should be switched back to using latest. @sjoshi-jpl @jordanpadams ?

@sjoshi-jpl once this is deployed to ECR, I'd recommend the following approach to tune provisioned memory/vCPUs

provision 200GB ephemeral ECS storage
cut vCPUs to 1, and memory to the lesser of current-value/2 and 8GB (max permitted for 1vCPU). For very-small nodes, 0.25-0.5vCPUs may be warranted
run sweeper once - if it crashes, set env var ANCESTRY_DISK_DUMP_MEMORY_THRESHOLD=60 and try again (you should not have to cut any further)
when complete, look for log line like 22:43:38,074::pds.registrysweepers.ancestry.generation::INFO::On-disk swap comprised of 3 files totalling 0.3GB
choose provisioned ephemeral ECS storage value according to the total size logged(maybe current*2?)

If you're able to benchmark provisioned memory/vCPUs before/after and runtime before/after, awesome, but I understand that's annoying work.

@nutjob4life firm agreement in principle.

I sort of wish I understood the problem a bit better.

If you're happy to sanity-check, the initial problem is that we need to take a large set of collection-level documents containing references to all non-aggregate products present in that collection, and create nonaggregate-level records containing

all parent collections of that non-agg
all parent bundles of those collections

This requires (naively) keeping the set of all non-agg records in memory from the start of the collection-level document iteration, to ensure that a non-agg's history is complete before a db update is sent off to the db (since reading, appending, then updating requires an infeasible quantity of db calls).

So my solution was to say "okay, let's write the current page of non-agg history to disk when it gets too large, then we can merge the histories together later in a way which doesn't require holding it all in-memory. There is now additional motivation to use as much memory as possible, since it's a bottleneck for execution time now, in addition to costing more to provision if the memory use is unnecessarily peaky.

The problem, then, is that a large proportion of the memory demand (let's say 100%, for simplicity's sake) is due to the need to take dict-like chunks of size S and then perform

designate a chunk "active"
for each other chunk, remove/merge its values to the active chunk for all keys which are present in the active chunk
now that the active chunk is known to contain complete values for all its keys, send it off to the db as part of a _bulk write
rinse and repeat for remaining chunks

Ideally, the amount of memory required should be ~2S, since two pages are loaded simultaneously.

What ends up happening is that additional data is accumulated into the active chunk, resulting in _(2S + mergeddata) memory use... my solution was to split the chunks in half when they got larger than the largest non-merged chunk. Kinda brainless, but it works.

Finally, the manual garbage collection. At this point, the memory usage spikes to ~3S for a split second right as a new inactive chunk is loaded from disk for merging. I think what is happening is that the active and previous-inactive chunks are loaded (2S), then the third chunk starts loading before the GC releases the now-unneeded previous-inactive chunk. Manually calling del then gc.collect() successfully prevents this by triggering the release as a blocking call prior to loading the next chunk from disk.

Possibly my understanding of what's going on is flawed. Almost-certainly there's a better solution for this, I'm just not aware of such.

NASA-PDS / registry-sweepers