Closed runarmyklebust closed 3 years ago
Ok, I had a look, tried to figure out what message was causing this but it disappeared from each node after restart. Exposed the debug-port on all cluster-instances, lets debug when it happens again to try to pinpoint the exact event or message that is sent.
stack-trace points to transport.inFlightRequestsBreaker().addEstimateBytesAndMaybeBreak(messageLengthBytes, "<transport_request>");
which uses in_flight_requests
CircuitBreaker
s
they get configured by
network.breaker.inflight_requests.limit
and network.breaker.inflight_requests.overhead
It is an indicator that transport protocol message does not fit into ES node heap.
Since it happened on master node (where data is not transported to, I assume), I suspect event or task payload was too big to fit. (Event has Map which can be any size)
It may be good idea to make our own configurable Circuit Breaker, so events/tasks are never unbounded.
We should do something about this; to be discussed
This also may be a consequence of enormous max_result_window we set https://github.com/enonic/xp/commit/00287ab169d2b930fa1b44a6a9c2e7c5fc484834
So, if master node queries data (to do a dump, for instance), data-node fetches it and sends full chunk to master. Master can't fit it into heap - circuit breaker prevents OutOfMemory
This may happen due to Elasticsearch overload, apparently. https://discuss.elastic.co/t/should-circuitbreakingexception-cause-the-node-to-become-failed/220817
@gbbirkisson @hjelmevold can you confirm this bug is still reproducible? If no - let's close this issue.
I have not seen this since HZ was introduced.
On the "fhi-xpqa" -installation (v7.0.1), snapshots stopped working, the following exception is given:
Research what this parent circuit breaker (https://www.elastic.co/guide/en/elasticsearch/reference/2.3/circuit-breaker.html#parent-circuit-breaker) is doing, why it triggers and what can be done.
Some notes: