Open polyfractal opened 6 years ago
Pinging @elastic/es-core-infra
Pinging @elastic/es-search-aggs
It's not just multi-search, even one (sizable) search request (i.e. via TransportSearchAction
) against many shards can cause an OOM. This is easily reproducible from at least 5.6-7.4, instantly so if logging to file/console is turned off.
While this issue involves circuit breakers, it is in fact just about a particular use of circuit breakers, but not the circuit breaker infrastructure. So I am removing the core/infra label and leaving to to the search team to implement/manage as they see fit.
Pinging @elastic/es-search-foundations (Team:Search Foundations)
TransportMultiSearchAction
has anAtomicArray
that is used to collect individual search responses as they finish executing. When the last request finishes, the results in the array are packaged into aMultiSearchResponse
and sent back to the client.If the multi-search involves a large number of shards, or the responses are very large, or there are many multi-searches in parallel (or all three)... this seems like a prime candidate for causing an OOM on smaller heaps.
I don't believe we track this array in any circuit breaker explicitly, and since it is holding finished results it isn't subject to any breakers in place for the search phase. If the responses come back to the coordinating node in a staggered manner it is unlikely to trip the in-flight breaker either.
It would be nice if we could account for this array in the Request breaker somehow. I imagine the tricky bit is estimating how big the various
SearchResponse
are (orException
in the case of failures). @dakrone suggested maybe selectingn
responses and averaging their size to use as a heuristic.Tagging both Core/CB and Search because I'm indecisive...sorry! :)