Any memory issues in production?

spiritinlife commented 2 years ago

Hey,

Thank you for your work on this. I have started using this rescore plugin in production for a week now and it works great but i have started seeing some random downtimes which i still haven't figured out.

It seems the parent circuit breaker trips but its relation to the other breakers doesn't justify it. Since the rescore plugin is the only change I have done on the cluster i am now wondering if there is a memory leak from the plugin.

The logs i get indicate that the error happens in response to a shard allocation procedure but the memory usage of the parent breaker doesn't make sense.

RemoteTransportException: [XXXXX][XXXXXX][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [4106680966/3.8gb], which is larger than the limit of [4080218931/3.7
gb], real usage: [4106664944/3.8gb], new bytes reserved: [16022/15.6kb], usages [request=0/0b, fielddata=71168/69.5kb, in_flight_requests=16022/15.6kb, accounting=2395632/2.2mb]

I might better address this to elasticsearch repo but i was wondering if you have you witnessed any similar issues?

anti-social commented 2 years ago

We don't yet upgradle Elasticsearch to 7.x. Our production clusters still works on 6.x and we didn't observe circuit breaker errors.

I don't think the plugin is responsible for this problem. In 7.x there was a significant circuit breaker update. Also there is a discussion.

As I understand circuit breaker tries to detect whether there is enough free heap to perform a request. But it doesn't know how much of the heap can be freed by GC.

Think you should try these options:

-XX:G1ReservePercent
-XX:InitiatingHeapOccupancyPercent

spiritinlife commented 2 years ago

@anti-social Thank you for your help and insights. Will keep investigating :)

anti-social / elasticsearch-rescore-grouping-mixup

Any memory issues in production? #2