NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Profile memory usage #39

Closed alexdunnjpl closed 5 months ago

alexdunnjpl commented 11 months ago

Peak memory usage appears to coincide with the chunked db writes.

Two things spring to mind off the bat

alexdunnjpl commented 11 months ago

2GB f(t) ticks image

alexdunnjpl commented 11 months ago

Switching from dict comprehensions to generator expressions seems like low-hanging fruit with good potential for ROI.

alexdunnjpl commented 11 months ago

Looks like there may not be as much easy gains as I'd hoped.

Given the memory-use suggested by the_bulk logs (some few-hundred MB), the majority of the memory usage seems to be from the objects used to construct the updates, rather than the updates themselves.

Because of how ancestry works, it's not possible to chunk (i.e. stream) the processing, as data for all nonaggregate lidvids must be available while ancestry is being built.

So there's probably not much peak memory usage wasted, there.

@jordanpadams suggest tabling this for the time being unless it's causing a problem (doubtful)

alexdunnjpl commented 11 months ago

Actually, there may be an opportunity to page data at the collection level - need to assess the reference dependencies to confirm.

alexdunnjpl commented 11 months ago

Should be able to roughly half memory usage by streaming update objects to db writes rather than generating all before writing

alexdunnjpl commented 6 months ago

Memory usage in ancestry is caused by accumulation of one AncestryRecord per non-agg product into a dict, during get_nonaggregate_ancestry_records().

Current approach (pseudocode):

state = {}
for c in collections:
  for p in collection.products:
    state[p].collections.append(c)
    state[p].bundles.append(p)

Three candidate approaches (pseudocode):

O(products) queries, O(1) space

Too slow - at 100ms/call and 25M products that's 700hrs. Concurrency ain't gonna fix that.

for c in collections:
  for p in c.products:
    collections_with_product = query(<registry-refs collections where collection contains p>)

    bundle_ancestry = set()
    collection_ancestry = set()

    for cwp in collections_with_product:
      bundle_ancestry = bundle_ancestry.union(cwp.bundles)
      collection_ancestry.add(cwp)

    yield AncestryRecord(lidvid=p.lidvid, parent_bundle_lidvids=bundle_ancestry, parent_collection_lidvids=collection_ancestry)

O(collections) queries, O(1) space (actually O(overlapping_collections), but that is relatively static)

for c in collections:
  collections_with_overlap = query(<registry-refs collections where collection contains any p for p in c.products>)
  for p in collection.products:
    collections_with_product = [cwo for cwo in collections_with_overlap if p in cwo.products]

    bundle_ancestry = set()
    collection_ancestry = set()

    for cwp in collections_with_product:
      bundle_ancestry = bundle_ancestry.union(cwp.bundles)
      collection_ancestry.add(cwp)

    yield AncestryRecord(lidvid=p.lidvid, parent_bundle_lidvids=bundle_ancestry, parent_collection_lidvids=collection_ancestry)

Faster no-redundancy version of previous O(collections) queries, O(products) space, but the scaling factor is "the size of a doc_id string" rather than "the size of an AncestryRecord object, every PdsProductIdentifier it contains, and every string/object they contain".

As above, but maintain a set to store already-processed non-agg product document ids so they aren't re-updated for every collection referencing them. Before every yield, add the document id to the set.

alexdunnjpl commented 6 months ago

Napkin math suggests ~2.5GB to store 25M 64char strings. Probably run with 3 if the update conflict avoidance for redundant work is nontrivial, but avoiding that overhead and going with 2 might be necessary later if provenance can be pared down significantly.

alexdunnjpl commented 6 months ago

Approach 2 has been implemented in ancestry-memory-optimixation.

Initial benchmarks with db writes omitted show:

en-prod

psa-prod

The issue here is that calculating overlap requires a terms query for registry-refs docs matching any of the up-to-10k non-aggregate lidvids which are present in the collection reference page being processed. This query is taking ~40sec to execute for each page. psa-prod currently contains almost one million pages.

It's possible that orders-of-magnitude performance improvement may be obtained by tuning from "get overlap for a full page of collection references in a single query" toward "get overlap for a single-product". I'll investigate that this week.

~1OoM improvement may be gained from making requests/updates in parallel, although this relies upon the OpenSearch instance having sufficient resources available in the first place, and given our ability to choke the instances with repairkit writes, I have little confidence.

I/we need to think about whether it's viable to generate and update ancestry references statefully - i.e. identify which subset of non-aggregate products require updating in the first place. If it's sufficient to say that only products belonging to a collection reference page which is not up-to-date require ancestry generation, then the problem becomes tractable again (if we can make 2-3OoM performance improvements via tuning)

@jordanpadams @tloubrieu-jpl

alexdunnjpl commented 6 months ago

Idea to check - can the document corpus be chunked (into, say, ~1M doc chunks), each chunk processed, writing to disk if need be, and the processed chunks merged? If so, this would be a viable solution, avoiding the need for numerous expensive queries.

alexdunnjpl commented 6 months ago

Chunk-and-merge ancestry implementation is tentatively complete.

Better benchmarks will have to wait, but as an indication:

en-prod - 1.2M products
  0m40s 1.8GB baseline
  1m29s 3.2GB   2M/pg (1pg)
  1m35s 1.5GB 500k/pg (3pg)
  1m57s 0.9GB 250k/pg (5pg)
  2m40s 0.5GB 100k/pg (12pg)

Dry-run, no db writes. This greatly exaggerates the effect on runtime, as every 20-100k db ops may take as long as a full minute once OpenSearch indexing capacity is saturated.

Uses ~30MB for every page of 100k products, so something like 7.5GB for 25M products. Totally workable.

May want to incorporate some computation of optimal page size from nonagg count and/or available RAM? Probably should at least be available as an env var for manual per-node tweaking by ops.

Actually, if we introduce stateful processing (which is viable per conversation with @jordanpadams ), execution time should drop drastically, especially if a similar chunking approach can be applied to provenance, so reducing RAM should be the focus imho.

Might be a good idea to programmatically guess at page size values based on the query total_hits. Tuck that away for future consideration.

alexdunnjpl commented 6 months ago

Looks like the overwhelming remaining proportion of memory is from maintaining the collection/bundle history.

Need to look at optimising storage of this metadata next. Consider replacing redundant string storage with integer IDs? Might slow things down too much, but you never know.

alexdunnjpl commented 6 months ago

Should be able to retrieve collections, sorted by lidvid, then retrieve collection-refs, also sorted by collection lidvid, and process according to collection page.

Significantly more fiddly, but totally possible.

alexdunnjpl commented 6 months ago

~Status: it looks as though scrolling may be an inappropriate vehicle for pagination over long periods of time - need to implement search-after pagination to avoid what appears to be a scroll window timeout, possibly due to an opensearch configured scroll window maximum (not overridable by client)? If search-after implementation isn't trivial, it's worth confirming the cause.~

~This may be relevant, though - need to follow up first~

~https://github.com/elastic/elasticsearch/issues/65381~

Say it with me - "scrolling is no longer recommended for deep pagination" - i.e. >10000 hits

Fixed with implementation of search-after

alexdunnjpl commented 5 months ago

What are these steps? Possibly overlap between pages at different levels of paging, but it's not immediately obvious.

image

Sawtooth size correlates with number of products in each disk-dump chunk

alexdunnjpl commented 5 months ago

non-resetting use was due to accumulation of data into the "active" chunk. fixed by implementing an automatic split of the active file as/when it exceeds the size of the largest pre-merge data chunk (this is a non-ideal approximation but appears to be working well enough)

Status: performance testing is in progress, but early signals indicate that we can successfully specify an upper memory bound in percent and have the application drive consumption no higher than that.

Still need to implement tests for some of the new functions, as they're nontrivial enough that this is non-negotiable.

tloubrieu-jpl commented 5 months ago

The development made by Alex requires some disk space. @sjoshi-jpl will see what parameter need to be changed in ECS to have ephemeral storage more that the default (21Gb) amount (max configurable: 200GB). psa-prod needs 500 Gb.

tloubrieu-jpl commented 5 months ago

@alexdunnjpl is testing a new ECS deployment which works well will less memory. The pull request should be ready shortly for review.