Logstash not exiting properly on OOM

Logstash information:

Please include the following information:

Logstash version : 8.5.2 Docker

OS version (uname -a if on a Unix-like system): Linux 5.4.0-1054-aws #57~18.04.1-Ubuntu SMP Thu Jul 15 03:21:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: Since https://github.com/elastic/logstash/issues/7987 was closed I expected an OOM to make logstash exit.

We got an OOM error in docker logs:

[11894.103s][warning][gc,alloc] [logging_proxy_requests]>worker20: Retried waiting for GCLocker too often allocating 976780 words
[11894.117s][warning][gc,alloc] [logging]>worker19: Retried waiting for GCLocker too often allocating 395131 words
[11894.130s][warning][gc,alloc] [logging_proxy_requests]>worker20: Retried waiting for GCLocker too often allocating 976778 words
[11894.143s][warning][gc,alloc] [logging_cluster]>worker23: Retried waiting for GCLocker too often allocating 370208 words
[12126.964s][warning][gc,alloc] [logging_cluster]>worker23: Retried waiting for GCLocker too often allocating 2851289 words
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid1.hprof ...
Unable to create java_pid1.hprof: File exists

But the process was still running and logging this:

{"level":"WARN","loggerName":"org.logstash.plugins.pipeline.PipelineBus","timeMillis":1680653294679,"thread":"[logging]>worker17","logEvent":{"message":"Attempted to send event to 'logging_proxy_requests' but that address was unavailable. Maybe the destination pipeline is down or stopping? Will Retry."}}

And it looked like client were still connecting to it, but would stall trying to send messages to it. The logs have rolled over so I'm not sure if there was other log messages in logstash.logs.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) pipeline definition(s), settings, locale, etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

Start logstash
Point beats to logstash
Set batch_size too high (we used 33k on 32 workers on 8 pipelines, we've since dialed it down)
Take down ES so LS needs to queue and retry.

It looked like https://github.com/elastic/logstash/pull/12470 didn't add -XX:+ExitOnOutOfMemoryError but instead installed an setDefaultUncaughtExceptionHandler which is was ES used to do, but ES also added -XX:+ExitOnOutOfMemoryError in https://github.com/elastic/elasticsearch/pull/71542 because there were cases where it would not bubble up (something caught it somewhere). (The case sounds a bit similar, it OOMed but didn't restart until 7 minutes later https://github.com/elastic/elasticsearch/issues/71443). It's usually better to die earlier since then clients can reconnect to a different node. Note that the node still shows status: green so there's no way externally to see that it's not healthy.

elastic / logstash

Logstash not exiting properly on OOM #14992