enonic / xp

Enonic XP
https://enonic.com
GNU General Public License v3.0
201 stars 34 forks source link

Hit Elasticsearch Parent Circuit Breaker on customer installation #7290

Closed runarmyklebust closed 3 years ago

runarmyklebust commented 5 years ago

On the "fhi-xpqa" -installation (v7.0.1), snapshots stopped working, the following exception is given:

2019-08-12 09:14:17,017 ERROR s.r.enonic.datatoolbox.RcdScriptBean - Error while creating snapshot
org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be larger than limit of [719093760/685.7mb]
    at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:247)
    at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:156)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:214)
    at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:116)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:75)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Research what this parent circuit breaker (https://www.elastic.co/guide/en/elasticsearch/reference/2.3/circuit-breaker.html#parent-circuit-breaker) is doing, why it triggers and what can be done.

Some notes:

runarmyklebust commented 5 years ago

Ok, I had a look, tried to figure out what message was causing this but it disappeared from each node after restart. Exposed the debug-port on all cluster-instances, lets debug when it happens again to try to pinpoint the exact event or message that is sent.

rymsha commented 5 years ago

stack-trace points to transport.inFlightRequestsBreaker().addEstimateBytesAndMaybeBreak(messageLengthBytes, "<transport_request>"); which uses in_flight_requests CircuitBreakers they get configured by network.breaker.inflight_requests.limit and network.breaker.inflight_requests.overhead

https://www.elastic.co/guide/en/elasticsearch/reference/2.4/circuit-breaker.html#in-flight-circuit-breaker

It is an indicator that transport protocol message does not fit into ES node heap.

Since it happened on master node (where data is not transported to, I assume), I suspect event or task payload was too big to fit. (Event has Map which can be any size)

It may be good idea to make our own configurable Circuit Breaker, so events/tasks are never unbounded.

runarmyklebust commented 4 years ago

We should do something about this; to be discussed

rymsha commented 4 years ago

This also may be a consequence of enormous max_result_window we set https://github.com/enonic/xp/commit/00287ab169d2b930fa1b44a6a9c2e7c5fc484834

So, if master node queries data (to do a dump, for instance), data-node fetches it and sends full chunk to master. Master can't fit it into heap - circuit breaker prevents OutOfMemory

rymsha commented 4 years ago

This may happen due to Elasticsearch overload, apparently. https://discuss.elastic.co/t/should-circuitbreakingexception-cause-the-node-to-become-failed/220817

rymsha commented 3 years ago

@gbbirkisson @hjelmevold can you confirm this bug is still reproducible? If no - let's close this issue.

gbbirkisson commented 3 years ago

I have not seen this since HZ was introduced.