InternalAggregations lack memory accounting

hendrikmuhs commented 2 years ago

This is a summary of memory accounting shortcomings found w.r.t. the aggregation framework in #83055 Related issues: #65019, #67476

The problem statement is complex and requires some explanations. The sub-problems are marked with P.{N} below.

Intro

Lifecycle of an aggregation

Every aggregation basically runs through multiple stages, namely the collection phase (mapping), executed on every shard, the post collection - this is where an InternalAggregation is build (Aggregator::buildAggregations) and the reduce phase (partial(combine) or final (reduce)).

Circuit breakers / memory accounting

To avoid out of memory elasticsearch uses so called circuit breakers. In a nutshell it "measures" the "current", which is our memory consumption and intercepts the flow (exception) if it notices that the consumption goes over budget. To make this work, every allocation must be registered, but also every free must be deducted. This all happens globally - not per request - and must be accurate in all situations, meaning it must survive every abnormal execution flow.

Circuit breakers and aggregations

The aggregation framework provides access to the circuit breaker as part of the AggregationContext and with some syntactic sugar as part of AggregatorBase::addRequestCircuitBreakerBytes, which takes care of releasing the bytes in close(). In addition the AggregationContext provides access to Bigarrays, an abstraction for larger arrays in memory. Bigarrays use the circuit breaker from the context.

Problem statements

Problem: AggregationContext only covers 1/3 in the lifecycle

The AggregationContext is only available in the collection phase. Although it is available when InternalAggregations are build, it gets released shortly after. The InternalAggregations stays in memory or gets serialized and send over. Because of that order, data structures must not use memory allocated from the BigArray instance of the AggregationContext.

P.1 Building the InternalAggregation lacks memory accounting / Big array support

(Existing workarounds: Cardinality deep-copies the HLL++ structure(see CardinalityAggregator), frequent_items disconnects the agg context circuitbreaker. This avoids the copy and avoids the risk to go out of memory during deep-copy)

The InternalAggregation might be send over the wire, meaning it gets serialized and deserialized. Deserialization happens before reduce, it lacks access to memory accounting structures, e.g. access to a BigArray instance.

P.2 Deserialization of InternalAggregation lacks access to CB/BigArrays

(Existing workaround: Cardinality (see InternalCardinality) uses the BigArrays.NON_RECYCLING_INSTANCE. Note, NON_RECYCLING is a bad name, it is not only non-recycling, but also non-accounting)

Instead of accounting for memory the reduce phase works differently: It takes 1.5 the serialized size of all internal aggregations and assumes that this budget is sufficient for the lifetime of the request. See QueryPhaseResultConsumer::estimateRamBytesUsedForReduce. Several problems arise from this educated guess:

P.3 Reduce must not use more than 1.5 times the sum of serialized sizes of all internal aggregations

Note: The serialized size is a bad indicator, because serialization uses compression, e.g. variable length encodings. The size in memory can be different to the size for serialization as e.g. helper structures like hash tables require extra memory.

P.4 BigArrays as part of AggregationReduceContext lead to double accounting memory usage

AggregationReduceContext provides access to a BigArrays instance, however if this instance is used, it accounts for memory used again. That's why it can only be used with care, otherwise memory is accounted for the serialized sizes as above and the allocation in this place in addition. A solution might be a hook to free the "educated guess based accounting" and return to proper memory management.

(Existing workaround: InternalCardinality does not use the bigarray instance but uses BigArrays.NON_RECYCLING_INSTANCE in reduce).

By nature a reduce usually frees more memory than it allocates, because it merges shard results. However currently this is not taken into account.

P.5 The reduce phase blocks memory due to overestimation

The magic 1.5 stays for the whole lifetime of the request, meaning it blocks memory although memory might have been freed meanwhile. Parallel incoming requests might trigger the circuit breaker, although there is no memory shortage. This problems gets more severe the longer the aggregation executes during final reduce. The frequent_items aggregation can take seconds and therefore blocks memory for longer.

Summary

The problems are complex. The most concerning are P.3-P.5, because it can lead to circuit breakers kicking in without need. For complex aggregations like frequent_items, where the main part of the logic happens on final reduce, we must ensure that the agg does not use more than 1.5 times the sum of serialized sizes (P.3) without knowing this budget in code. A 1st step might be to make this information accessible, e.g. as part of the reduce context.

elasticmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

not-napoleon commented 2 years ago

Good summary of the situation. There's definitely a lot of tech debt around this area. Our current thinking is that the starting point for addressing this is https://github.com/elastic/elasticsearch/issues/77449, which seeks to address the huge volume of untracked data we send over the wire. We've seen issues like https://github.com/elastic/elasticsearch/issues/72619 which we directly attribute to that concern.

In general, I would rather we were overly pessimistic in the circuit breakers (as long as we don't "leak"). If a user gets an unnecessary circuit breaker, yes, it's frustrating, but the query can be attempted again. If we OOM, that's a huge problem.

nik9000 commented 1 year ago

There's an issue called, I think, "dense representation for aggs" which comes at this from another angle- all of the features we could build with more efficient returns. Those returns would need memory accounting. For sure. So this is related to that issue. Somehow.

On Tue, Jun 28, 2022, 8:33 AM Elastic Machine @.***> wrote:

Pinging @elastic/es-analytics-geo https://github.com/orgs/elastic/teams/es-analytics-geo (Team:Analytics)

— Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/88128#issuecomment-1168662843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUXIUJZWAOMN7SFNCRQ23VRLWIRANCNFSM52CAG6DA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>

elastic / elasticsearch