Open hendrikmuhs opened 2 years ago
Pinging @elastic/es-analytics-geo (Team:Analytics)
Good summary of the situation. There's definitely a lot of tech debt around this area. Our current thinking is that the starting point for addressing this is https://github.com/elastic/elasticsearch/issues/77449, which seeks to address the huge volume of untracked data we send over the wire. We've seen issues like https://github.com/elastic/elasticsearch/issues/72619 which we directly attribute to that concern.
In general, I would rather we were overly pessimistic in the circuit breakers (as long as we don't "leak"). If a user gets an unnecessary circuit breaker, yes, it's frustrating, but the query can be attempted again. If we OOM, that's a huge problem.
There's an issue called, I think, "dense representation for aggs" which comes at this from another angle- all of the features we could build with more efficient returns. Those returns would need memory accounting. For sure. So this is related to that issue. Somehow.
On Tue, Jun 28, 2022, 8:33 AM Elastic Machine @.***> wrote:
Pinging @elastic/es-analytics-geo https://github.com/orgs/elastic/teams/es-analytics-geo (Team:Analytics)
— Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/88128#issuecomment-1168662843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUXIUJZWAOMN7SFNCRQ23VRLWIRANCNFSM52CAG6DA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>
This is a summary of memory accounting shortcomings found w.r.t. the aggregation framework in #83055 Related issues: #65019, #67476
The problem statement is complex and requires some explanations. The sub-problems are marked with P.{N} below.
Intro
Lifecycle of an aggregation
Every aggregation basically runs through multiple stages, namely the collection phase (mapping), executed on every shard, the post collection - this is where an
InternalAggregation
is build (Aggregator::buildAggregations
) and the reduce phase (partial(combine) or final (reduce)).Circuit breakers / memory accounting
To avoid out of memory elasticsearch uses so called circuit breakers. In a nutshell it "measures" the "current", which is our memory consumption and intercepts the flow (exception) if it notices that the consumption goes over budget. To make this work, every allocation must be registered, but also every free must be deducted. This all happens globally - not per request - and must be accurate in all situations, meaning it must survive every abnormal execution flow.
Circuit breakers and aggregations
The aggregation framework provides access to the circuit breaker as part of the
AggregationContext
and with some syntactic sugar as part ofAggregatorBase::addRequestCircuitBreakerBytes
, which takes care of releasing the bytes inclose()
. In addition theAggregationContext
provides access toBigarrays
, an abstraction for larger arrays in memory.Bigarrays
use the circuit breaker from the context.Problem statements
Problem: AggregationContext only covers 1/3 in the lifecycle
The
AggregationContext
is only available in the collection phase. Although it is available whenInternalAggregations
are build, it gets released shortly after. TheInternalAggregations
stays in memory or gets serialized and send over. Because of that order, data structures must not use memory allocated from the BigArray instance of theAggregationContext
.P.1 Building the InternalAggregation lacks memory accounting / Big array support
(Existing workarounds: Cardinality deep-copies the HLL++ structure(see
CardinalityAggregator
),frequent_items
disconnects the agg context circuitbreaker. This avoids the copy and avoids the risk to go out of memory during deep-copy)The
InternalAggregation
might be send over the wire, meaning it gets serialized and deserialized. Deserialization happens before reduce, it lacks access to memory accounting structures, e.g. access to a BigArray instance.P.2 Deserialization of InternalAggregation lacks access to CB/BigArrays
(Existing workaround: Cardinality (see
InternalCardinality
) uses theBigArrays.NON_RECYCLING_INSTANCE
. Note,NON_RECYCLING
is a bad name, it is not only non-recycling, but also non-accounting)Instead of accounting for memory the reduce phase works differently: It takes
1.5
the serialized size of all internal aggregations and assumes that this budget is sufficient for the lifetime of the request. SeeQueryPhaseResultConsumer::estimateRamBytesUsedForReduce
. Several problems arise from this educated guess:P.3 Reduce must not use more than 1.5 times the sum of serialized sizes of all internal aggregations
Note: The serialized size is a bad indicator, because serialization uses compression, e.g. variable length encodings. The size in memory can be different to the size for serialization as e.g. helper structures like hash tables require extra memory.
P.4 BigArrays as part of AggregationReduceContext lead to double accounting memory usage
AggregationReduceContext
provides access to a BigArrays instance, however if this instance is used, it accounts for memory used again. That's why it can only be used with care, otherwise memory is accounted for the serialized sizes as above and the allocation in this place in addition. A solution might be a hook to free the "educated guess based accounting" and return to proper memory management.(Existing workaround: InternalCardinality does not use the bigarray instance but uses
BigArrays.NON_RECYCLING_INSTANCE
in reduce).By nature a reduce usually frees more memory than it allocates, because it merges shard results. However currently this is not taken into account.
P.5 The reduce phase blocks memory due to overestimation
The magic 1.5 stays for the whole lifetime of the request, meaning it blocks memory although memory might have been freed meanwhile. Parallel incoming requests might trigger the circuit breaker, although there is no memory shortage. This problems gets more severe the longer the aggregation executes during final reduce. The
frequent_items
aggregation can take seconds and therefore blocks memory for longer.Summary
The problems are complex. The most concerning are P.3-P.5, because it can lead to circuit breakers kicking in without need. For complex aggregations like
frequent_items
, where the main part of the logic happens on final reduce, we must ensure that the agg does not use more than 1.5 times the sum of serialized sizes (P.3) without knowing this budget in code. A 1st step might be to make this information accessible, e.g. as part of the reduce context.