NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

39 - implement ancestry chunking memory optimization #93

Closed alexdunnjpl closed 5 months ago

alexdunnjpl commented 5 months ago

🗒️ Summary

Implements a new chunked processing paradigm for ancestry sweeper to reduce provisioned memory requirement at the cost of some runtime speed and the need to provision enough disk space to perform swapping. The disk dump avoids the need to hold the full ancestry history in memory to generate AncestryRecords.

Performance tuning is available via env vars ANCESTRY_NONAGGREGATE_QUERY_PAGE_SIZE (default 2000) and ANCESTRY_DISK_DUMP_MEMORY_THRESHOLD (expressed in percent, default 80). Further details are available in ancestry.runtimeconstants.py

Performance test on en-prod resulted in ability to drop from 1vCPU and 4GB memory to 0.5vCPU and 1GB memory (cost savings of 42%), at the cost of 31% increased runtime. Proportional benefit is expected to be significantly higher for nodes with greater quantities of data as higher memory allocations require more vCPUs, all except one of which are wasted, though this will require confirmation. This also, critically, removes the inability to scale to large quantities of products due to finite ability to increase provisioned RAM.

⚙️ Test Data and/or Report

Regression tests augmented/updated/passed. Functional tests implemented for new dump/merge behaviour to ensure production of correct update content. Manually tested against en-prod with a dry-run and shown to produce equal update count to pre-PR version. Performance-tested against en-prod.

~N.B. Docs are failing due to sphinx issue (locally and in github actions)~ not anymore

♻️ Related Issues

fixes #39

Ops notes

I switched ECS task definitions to use tag stable rather than latest, thinking that latest would automatically reference the last-pushed image (and we don't want a dev pushing a development-related tag and causing all tasks to use that by default), but this appears not to be the case. Is latest a manual tag rather than something auto-generated? If so, task definitions should be switched back to using latest. @sjoshi-jpl @jordanpadams ?

@sjoshi-jpl once this is deployed to ECR, I'd recommend the following approach to tune provisioned memory/vCPUs

If you're able to benchmark provisioned memory/vCPUs before/after and runtime before/after, awesome, but I understand that's annoying work.

alexdunnjpl commented 5 months ago

@nutjob4life firm agreement in principle.

I sort of wish I understood the problem a bit better.

If you're happy to sanity-check, the initial problem is that we need to take a large set of collection-level documents containing references to all non-aggregate products present in that collection, and create nonaggregate-level records containing

This requires (naively) keeping the set of all non-agg records in memory from the start of the collection-level document iteration, to ensure that a non-agg's history is complete before a db update is sent off to the db (since reading, appending, then updating requires an infeasible quantity of db calls).

So my solution was to say "okay, let's write the current page of non-agg history to disk when it gets too large, then we can merge the histories together later in a way which doesn't require holding it all in-memory. There is now additional motivation to use as much memory as possible, since it's a bottleneck for execution time now, in addition to costing more to provision if the memory use is unnecessarily peaky.

The problem, then, is that a large proportion of the memory demand (let's say 100%, for simplicity's sake) is due to the need to take dict-like chunks of size S and then perform

  1. designate a chunk "active"
  2. for each other chunk, remove/merge its values to the active chunk for all keys which are present in the active chunk
  3. now that the active chunk is known to contain complete values for all its keys, send it off to the db as part of a _bulk write
  4. rinse and repeat for remaining chunks

Ideally, the amount of memory required should be ~2S, since two pages are loaded simultaneously.

What ends up happening is that additional data is accumulated into the active chunk, resulting in _(2S + mergeddata) memory use... my solution was to split the chunks in half when they got larger than the largest non-merged chunk. Kinda brainless, but it works.

Finally, the manual garbage collection. At this point, the memory usage spikes to ~3S for a split second right as a new inactive chunk is loaded from disk for merging. I think what is happening is that the active and previous-inactive chunks are loaded (2S), then the third chunk starts loading before the GC releases the now-unneeded previous-inactive chunk. Manually calling del then gc.collect() successfully prevents this by triggering the release as a blocking call prior to loading the next chunk from disk.

Possibly my understanding of what's going on is flawed. Almost-certainly there's a better solution for this, I'm just not aware of such.