Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.35k stars 1.06k forks source link

Graylog container stops working with java.lang.OutOfMemoryError: Java heap space #19806

Open pbrzica opened 3 months ago

pbrzica commented 3 months ago

We have noticed since upgrading to major version 6 from 5 a new issue. Every couple of days or so Graylog first starts logging the following:

docker-graylog-1  | 01:10:49.650 [processbufferprocessor-1] WARN  org.graylog2.streams.StreamRouterEngine - Error matching stream rule <646b7aa40063d45f4807fe8a>  <REGEX/^prod-logging> for stream Random Stream Name
docker-graylog-1  | java.util.concurrent.TimeoutException: null
docker-graylog-1  |     at java.base/java.util.concurrent.FutureTask.get(Unknown Source) ~[?:?]
docker-graylog-1  |     at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:153) ~[graylog.jar:?]
docker-graylog-1  |     at org.graylog2.streams.StreamRouterEngine$Rule.matchWithTimeOut(StreamRouterEngine.java:325) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.streams.StreamRouterEngine.match(StreamRouterEngine.java:206) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.streams.StreamRouter.route(StreamRouter.java:104) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.messageprocessors.StreamMatcherFilterProcessor.route(StreamMatcherFilterProcessor.java:66) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.messageprocessors.StreamMatcherFilterProcessor.process(StreamMatcherFilterProcessor.java:81) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.handleMessage(ProcessBufferProcessor.java:167) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.dispatchMessage(ProcessBufferProcessor.java:137) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:107) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:52) [graylog.jar:?]
docker-graylog-1  |     at org.graylog2.shared.buffers.PartitioningWorkHandler.onEvent(PartitioningWorkHandler.java:52) [graylog.jar:?]
docker-graylog-1  |     at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:167) [graylog.jar:?]
docker-graylog-1  |     at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:122) [graylog.jar:?]
docker-graylog-1  |     at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
docker-graylog-1  |     at java.base/java.lang.Thread.run(Unknown Source) [?:?]

As time goes by more and more of these logs appear and then everything starts crashing with Java heap space errors. Example:

docker-graylog-1  | 02:16:55.439 [scheduled-daemon-23] ERROR org.graylog2.shared.bindings.SchedulerBindings - Thread scheduled-daemon-23 failed by not catching exception: java.lang.OutOfMemoryError: Java heap space.
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 30"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 25"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 27"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 20"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 2"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "AMQP Connection 127.0.0.1:5672"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "stream-router-62"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 31"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 5"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 31"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 22"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "aws-instance-lookup-refresher-0"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 17"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 15"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 20"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-71-thread-1"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "I/O dispatcher 26"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "cluster-rtt-ClusterId{value='6680ee1f1b2f3946cfffaba1', description='null'}-127.0.0.1:27017"
docker-graylog-1  |
docker-graylog-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "inputbufferprocessor-1"
docker-graylog-1  | 02:52:12.221 [inputbufferprocessor-4] WARN  org.graylog2.shared.buffers.InputBufferImpl - Unable to process event RawMessageEvent{raw=null, uuid=cc5ef3b5-3818-11ef-9476-4a8499687792, encodedLength=1711}, sequence 458134355
docker-graylog-1  | java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
docker-graylog-1  | 02:52:12.221 [scheduled-daemon-0] ERROR org.graylog2.shared.bindings.SchedulerBindings - Thread scheduled-daemon-0 failed by not catching exception: java.lang.OutOfMemoryError: Java heap space.

We receive the most logs from our RabbitMQ input and we average around 3-4k messages per second per node. I've tried increasing the heap and decreasing the number of processors but nothing seems to help.

I've also attached load and memory graphs (at around 5:00 is when Graylog completely stops working, but prior to that load and memory usage is completely normal) Screenshot from 2024-07-02 11-01-32

Expected Behavior

Graylog doesn't run out of heap space

Current Behavior

Graylog works fine for some time and then at random intervals starts crashing due to heap errors

Possible Solution

I've noticed that increasing the heap increases the duration that Graylog stays healthy so is it possible this is a memory leak some where?

Steps to Reproduce (for bugs)

  1. Run Graylog and ingest logs
  2. After some time Graylog crashes with heap errors (happens even if heap is increased by more than 2x)

Context

Self explanatory

Your Environment

Using the official Graylog docker image

tellistone commented 3 months ago

Hello, thanks for raising this, looks like a memory leak

re: Error matching stream rule <646b7aa40063d45f4807fe8a> <REGEX/^prod-logging> for stream Random Stream Name

Is the associated stream receiving messages from the rabbitMQ input?

pbrzica commented 3 months ago

Hi, just checked the logs.

The associated stream is receiving logs from RabbitMQ but we start getting these errors on all of our streams, including ones using GELF TCP inputs (above was just an example). I am guessing as memory gets lower it happens more and more until it finally runs out. If it can help, almost all of our streams (76 of them, excluding the system ones) use regex on the stream rules. Mainly 2 or 3 rules in the style of:

source: ^prod- or channel: ^service$

I can try setting up some more Graylog metrics if you think they'd be helpful (just need info if you have any specific ones in mind).

I'll also try stopping/starting the inputs after some time to see if that can maybe help.

pbrzica commented 3 months ago

Just noticed #19629 I don't know if anything specific uses it but we will update to 6.0.4 today and report back.

pbrzica commented 2 months ago

Reporting back. Graylog has been up for 11 days without any issues on version 6.0.4

thll commented 2 months ago

Thanks for the feedback, @pbrzica. Very much appreciated!