Output buffer not being filled with new messages

madiTG commented 7 years ago

Expected Behavior

I expect messages to be moved from the processor buffer to output buffer constantly

Current Behavior

Processor buffer is being filled and output buffer stays empty, nothing is being written to database

Possible Solution

Possibly a thread lock or something like that

Steps to Reproduce (for bugs)

I reproduce this putting a lot of messages into graylog, it works for some time and then stops to send messages to ES cluster. I have no idea how can You reproduce this

Context

Pushing a large amount cca 10k/s high volume messages to graylog

Thread dump of affected node: Thread dump of node 6e52bb4b / graylog2 Taken at Wed Nov 08 2017 11:32:36 GMT+0100

"outputbufferprocessor-0" id=32 state=WAITING
    - waiting on <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:45)
    at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
    at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
    at java.lang.Thread.run(Thread.java:748)
"outputbuffer-processor-executor-0" id=219 state=WAITING
    - waiting on <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
"outputbufferprocessor-1" id=33 state=WAITING
    - waiting on <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:45)
    at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
    at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
    at java.lang.Thread.run(Thread.java:748)

"outputbufferprocessor-2" id=34 state=WAITING
    - waiting on <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:45)
    at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
    at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-0" id=220 state=WAITING
    - waiting on <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-1" id=221 state=WAITING
    - waiting on <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-2" id=222 state=WAITING
    - waiting on <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-0" id=223 state=WAITING
    - waiting on <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-1" id=224 state=WAITING
    - waiting on <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-2" id=225 state=WAITING
    - waiting on <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-1" id=226 state=WAITING
    - waiting on <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-2" id=227 state=WAITING
    - waiting on <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Your Environment

Graylog Version: v2.3.2+3df951e
Elasticsearch Version: 5.6.3
MongoDB Version: v3.4.10
Operating System: RHEL 7.3
Browser version: irrelevant

joschi commented 7 years ago

For reference: https://community.graylog.org/t/graylog-stops-processing-writing-messages/2719

madiTG commented 7 years ago

This seems to be the same problem as in https://github.com/Graylog2/graylog2-server/issues/3553

Some thread data from JMX: graylog_jmx

gruselglatz commented 7 years ago

Seeing the same issue here, lowered from 10 outputprocessors to 5 and raised the batch size to 1500 but no luck.

Its Processing for a few seconds, then dying again.

Graylog 2.3.2+3df951e on syslog.xyz.local (Oracle Corporation 1.8.0_144 on Linux 3.10.0-693.2.2.el7.x86_64)

bernd commented 6 years ago

We need the complete Graylog and Elasticsearch logs to be able to start debugging this.

davemcphee commented 6 years ago

Experiencing exactly the same issue here. Above a certain threshold, seeing exact same results as madiTG. Threshold varies depending on various graylog config options related to output buffers, but in the end, graylog hangs and has to be restarted, no problems with ES, no errors in log.

As the https://community.graylog.org/t/graylog-stops-processing-writing-messages/2719 comments indicate, logs are empty.

gruselglatz commented 6 years ago

@bernd its a little bit late for logs, but both logs were empty. Otherwise i would have posted it all along the issue. As @davemcphee said, it simply stops processing messages and begins writing it to all buffers and journal, but they will never be processed.

bernd commented 6 years ago

@davemcphee @gruselglatz Okay, thanks for the update. After reading the topic in the community forum again, it looks like some message processing is blocking. Since the output buffers are basically empty but the process buffer is full, Graylog cannot process any more messages.

To have a chance to debug this further we need the following information:

A list of installed plugins
- curl localhost:9000/api/system/plugins
All inputs + all extractors
- curl localhost:9000/api/system/inputs (collect all input ids)
- curl localhost:9000/api/system/inputs/<input-id>/extractors (for every input id)
All streams + streamrules
- curl localhost:9000/api/streams
All drools rules (if you use any)
Message processors config
- curl localhost:9000/api/system/messageprocessors/config
All lookup tables, caches and adapters
- curl localhost:9000/api/system/lookup/tables
- curl localhost:9000/api/system/lookup/caches
- curl localhost:9000/api/system/lookup/adapters
All pipelines, pipeline connections and pipeline rules
- curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/pipeline
- curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/rule
- curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/connections
All grok patterns
- curl localhost:9000/api/system/grok

Feel free to send me this information as a private message in the community forum if you don't want to share all of this in public.

crazydavidoff commented 6 years ago

Hi all. This problem torments me for a month. I had to add new nodes to reduce the load. I'm using version 2.2.3. I hope this bug will fix soon. Tell me if you need more information about my instance.

davemcphee commented 6 years ago

Moin @bernd ,

The GL community doesn't trust me enough to PM you random links yet, so here you are: https://www.dropbox.com/s/pu5r9yfz8x0oyyt/graylog_debug.tar.gz?dl=0

Very little of interest there I'm afraid. It's a uat system so only one "real" input: the Syslog UDP listener, which is listening for nginx access logs. These logs are in json format, but wrapped in an rsyslog header, so I use a regex to cut out the json string, then key/value extract that json. The UAT system can generate 20k msg/s and bursts up to 40k, but we can (and ofc do) throttle this.

I have elasticsearch and graylog prometheus exporters running, and would be happy to test various config changes, tweak message throttling, and show results.

bernd commented 6 years ago

@davemcphee Thank you for the information! As you said, there isn't that much in there, unfortunately.

Can you send me a complete thread dump of the Graylog JVM process when this happens again? If processing is blocked, wait a few minutes and then take a thread dump. Also please do a curl localhost:9000/api/system/metrics after you took the dump and include it as well.

How is the CPU load when the processing is blocked? If you have some graphs for that as well, it would be much appreciated.

Could you show us an example raw message with a typical JSON payload?

Thank you!

madiTG commented 6 years ago

@bernd please check the community forum (referenced here) for the CPU charts. There are some thread dumps out there too.

jebucha commented 6 years ago

You may have worked through your situation already, but what I found when I encountered a similar scenario earlier this year is that under certain message conditions and ingest loads my regex extractors were taking an exorbitantly long time to run (relatively speaking). Took awhile to narrow that down but once I was on that track I was able to use the metrics for individual extractors to gauge my success in tuning. I went through each of my extractors and tuned the regex, dropping execution times from like 50,000μs to 50μs. I had to move away from using "any character" matches to specific matches, and also be much more intentional with my extractor condition matches. At worst last Spring my Graylog nodes were backing up by > 1 million events each, several times per week. Today my ingest/search rates are higher than back then but I almost never have backups. The only time I have backups now (small ones at that) is when a given system or device goes haywire and floods with 10x normal message rate for a few minutes.

madiTG commented 6 years ago

I only use json extractors, tried with pipelines turned off - same effect. Already went through regex optimalisation. Execution times trend to 0 as nothing is being processed.

bernd commented 6 years ago

@madiTG I read the topic in the community forum and saw your graphs. The thread dumps were not complete and contained only the part for the outputbuffer threads. Also, I guess your setup is not the same as @davemcphee's so his CPU graphs might be different. :wink:

davemcphee commented 6 years ago

I had never used jstack before so after capturing one or two thread dumps I began playing with some config settings and seeing if I could observe a difference in the jstack output, and ummmm ... I think I've fixed my issue :grimacing: Isn't that just typical ....

I can't confirm anything and need to test further (and probably not before end of year) but I believe it had something to do with size of my ring_size and inputbuffer_ring_size - I reduced them massively.

I'm currently running with ring_size=8192 and inputbuffer_ring_size=32768, and 12 output processors, batch size 12k, and am able to process 30K msg/s, bursting to over 40K, on a single machine.

These numbers did require tuning heap of both ES and Graylog extensively (less is more), along with number of shards per ES index (more is more). I'll keep an eye on it and report back if I have anything to add.

madiTG commented 6 years ago

@davemcphee

hey dave, got any update on that?

berndleinfelder commented 6 years ago

madiTG commented 6 years ago

Still happening:

@berndleinfelder

Did the cancel thing, nothing happened

thread dump here:

jstack_dump.txt

org.graylog2.buffers.process.size Gauge

Value: 8,192

org.graylog2.buffers.process.usage Gauge

Value: 8,192

org.graylog2.shared.buffers.ProcessBuffer.decodeTime Timer

95th percentile: 33μs 98th percentile: 47μs 99th percentile: 55μs Standard deviation: 7μs Mean: 21μs Minimum: 15μs Maximum: 97μs

org.graylog2.shared.buffers.ProcessBuffer.incomingMessages Meter

Total: 189,234,468 events Mean: 1,036.23 events/second 1 minute avg: 0 events/second 5 minute avg: 62.37 events/second 15 minute avg: 634.78 events/second

org.graylog2.shared.buffers.ProcessBuffer.parseTime Timer

95th percentile: 11μs 98th percentile: 23μs 99th percentile: 34μs Standard deviation: 4μs Mean: 7μs Minimum: 3μs Maximum: 56μs

org.graylog2.shared.buffers.ProcessBuffer.thread-factory.created Meter

Total: 4 events Mean: 0 events/second 1 minute avg: 0 events/second 5 minute avg: 0 events/second 15 minute avg: 0 events/second

org.graylog2.shared.buffers.ProcessBuffer.thread-factory.running Counter

Value: 4

org.graylog2.shared.buffers.ProcessBuffer.thread-factory.terminated Meter

Total: 0 events Mean: 0 events/second 1 minute avg: 0 events/second 5 minute avg: 0 events/second 15 minute avg: 0 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.filteredOutMessages Meter

Total: 0 events Mean: 0 events/second 1 minute avg: 0 events/second 5 minute avg: 0 events/second 15 minute avg: 0 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.incomingMessages Meter

Total: 189,234,468 events Mean: 1,036.25 events/second 1 minute avg: 0 events/second 5 minute avg: 62.84 events/second 15 minute avg: 636.41 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.outgoingMessages Meter

Total: 189,234,467 events Mean: 1,036.25 events/second 1 minute avg: 0 events/second 5 minute avg: 62.84 events/second 15 minute avg: 636.41 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime Timer

95th percentile: 730μs 98th percentile: 1,298μs 99th percentile: 1,993μs Standard deviation: 821μs Mean: 447μs Minimum: 225μs Maximum: 17,226μs

Changed my config values to those provided by dave

Restarting node solved the issue

drAlberT commented 6 years ago

Anybody found a solution?

In my logs I only have tons of:

2018-05-23T10:25:47.339Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #70).
2018-05-23T10:25:47.860Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #30).
2018-05-23T10:26:06.049Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #70).
2018-05-23T10:26:10.231Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:18.574Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:19.230Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #31).
2018-05-23T10:26:37.210Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:41.180Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:26:49.858Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:26:50.438Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #32).
2018-05-23T10:27:08.339Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:27:12.147Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:21.060Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:21.591Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #33).
2018-05-23T10:27:39.468Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:43.102Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:27:52.301Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:27:52.737Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #34).
2018-05-23T10:28:10.706Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:28:14.151Z INFO  [Messages] Bulk indexing finally successful (attempt #75).

joschi commented 6 years ago

@drAlberT Your Elasticsearch cluster is down or unreachable.

Please post this issue to our discussion forum or join the #graylog channel on freenode IRC.

Thank you!

drAlberT commented 6 years ago

Anyway, I'm not so sure:

root@graylog-a:~# curl -L https://vpc-graylog-22XXXXXXXXXXXjsuie4.eu-west-1.es.amazonaws.com
{
  "name" : "jS4Ja1Q",
  "cluster_name" : "903437758265:graylog",
  "cluster_uuid" : "nWkEX1wmRcaCuh5MDePpTA",
  "version" : {
    "number" : "5.5.2",
    "build_hash" : "Unknown",
    "build_date" : "2017-10-18T04:35:01.381Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.0"
  },
  "tagline" : "You Know, for Search"
}

So the ES cluster is alive, the graylog node can reach it, I double-checked GL conf and the clustername is right .. after a bit of changes I got the following trace .. it should be saying I need to increase the Java heap of GL process .. right?

2018-05-24T16:58:33.364Z ERROR [Messages] Caught exception during bulk indexing: java.lang.OutOfMemoryError: Java heap space, retrying (attempt #1).
2018-05-24T16:58:33.386Z ERROR [Messages] Couldn't bulk index 500 messages.
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at com.github.rholder.retry.Retryer$ExceptionAttempt.<init>(Retryer.java:254) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:163) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) [graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:152) [graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:128) [graylog.jar:?]
        at org.graylog2.outputs.ElasticSearchOutput.writeMessageEntries(ElasticSearchOutput.java:111) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:129) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.writeMessageEntry(BlockingBatchedESOutput.java:110) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:92) [graylog.jar:?]
        at org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:191) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_171]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:131) ~[?:1.8.0_171]
        at io.searchbox.core.Bulk.getData(Bulk.java:108) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.prepareRequest(JestHttpClient.java:134) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:75) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:64) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.lambda$runBulkRequest$2(Messages.java:169) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages$$Lambda$231/1508904764.call(Unknown Source) ~[?:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) ~[graylog.jar:?]
        ... 14 more
2018-05-24T16:58:33.394Z ERROR [BlockingBatchedESOutput] Unable to flush message buffer
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:176) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:152) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:128) ~[graylog.jar:?]
        at org.graylog2.outputs.ElasticSearchOutput.writeMessageEntries(ElasticSearchOutput.java:111) ~[graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:129) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.writeMessageEntry(BlockingBatchedESOutput.java:110) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:92) [graylog.jar:?]
        at org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:191) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_171]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at com.github.rholder.retry.Retryer$ExceptionAttempt.<init>(Retryer.java:254) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:163) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) ~[graylog.jar:?]
        ... 13 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:131) ~[?:1.8.0_171]
        at io.searchbox.core.Bulk.getData(Bulk.java:108) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.prepareRequest(JestHttpClient.java:134) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:75) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:64) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.lambda$runBulkRequest$2(Messages.java:169) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages$$Lambda$231/1508904764.call(Unknown Source) ~[?:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) ~[graylog.jar:?]
        ... 13 more
2018-05-24T16:58:34.350Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #1).

Having increased it to 3G removed the Heap error .. but not the original issue:

2018-05-24T17:05:13.607Z INFO  [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2018-05-24T17:05:13.632Z INFO  [InputStateListener] Input [Syslog UDP/5a76da165c6f490910ffbc15] is now STARTING
2018-05-24T17:05:13.639Z INFO  [InputStateListener] Input [Syslog TCP/5a76e2b7c228180627417070] is now STARTING
2018-05-24T17:05:13.642Z INFO  [InputStateListener] Input [Beats/5a859f85d0cf6e4b7317d648] is now STARTING
2018-05-24T17:05:13.678Z INFO  [InputStateListener] Input [Beats/5a859f85d0cf6e4b7317d648] is now RUNNING
2018-05-24T17:05:13.681Z INFO  [InputStateListener] Input [Syslog TCP/5a76e2b7c228180627417070] is now RUNNING
2018-05-24T17:05:13.709Z INFO  [InputStateListener] Input [Syslog UDP/5a76da165c6f490910ffbc15] is now RUNNING
2018-05-24T17:05:23.928Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #1).
2018-05-24T17:05:26.594Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #2).
2018-05-24T17:05:28.522Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #3).
2018-05-24T17:05:30.235Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #4).
2018-05-24T17:05:31.623Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #5).
2018-05-24T17:05:33.120Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #6).
2018-05-24T17:05:34.664Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #7).

Geramy commented 6 years ago

Any changes in this, we are getting about 6 messages a second but nothing is being processed there are no errors in the logs and everything is healthy same context as this. This seems to be a bug as we are not running much configuration differences from the default.

jalogisch commented 6 years ago

@Geramy you should move over to the community. Please post this issue to our discussion forum or join the #graylog channel on freenode IRC.

you did not add any valuable information that might help to debug
you did not include what version of Graylog you are having the trouble
you did not describe properly how this happens.

mahnunchik commented 6 years ago

The same issue. ES is green. Mongo without any error. All messages stuck in journal.

Graylog 2.4 Mongo 3.6 ES 5.6.12

mahnunchik commented 6 years ago

I've solved the problem by removing graylog journal data. Messages started being indexed immediately.

bernd commented 5 years ago

I am closing this issue now. We didn't manage to find the root cause unfortunately, but it also doesn't make sense to keep it open. Please visit the discussion forum for help if you run into this again or create a new issue with updated information. (logs, metrics, etc.) Also please make sure to run the latest Graylog stable release.

Thank you!

EpiqSty commented 5 years ago

We just faced with the same issue today.

Graylog node have stopped processing messages. Meanwhile there were no issues with the ES, no issued with the Disk, space, Memory amount or the CPU load on the affected node. It just got stuck. Processing buffer was full, but output buffer was empty.

So, I've disabled the affected node at the LB, in order to prevent messages queue groving and started googling.

And I've arrived here from the https://community.graylog.org/t/graylog-stops-processing-writing-messages/2719/10

And at the end found that bug closed without any resolution but with the advice to go back to community.

So looks like nobody's intrested in the solution, that's sad.

One good thing in our case, after 40-60 minutes of being in the stuck state, node started processing messages again.

mzhaase commented 5 years ago

This happens a multiple times a day for us. The only thing that helps is to delete the journal. We are losing a couple million messages a day to this. Not mentioning the countless hours of dev and ops time wasted on this.

This renders our graylog cluster all but unusable. If no solution is found we probably have to move our org away from graylog.

bernd commented 5 years ago

@EpiqSty @mzhaase I will re-open this. We need the following information to have a chance to debug this:

Graylog version
Elasticsearch version
Full Graylog server logs (from the time when the issue appears in your setup)
Full Elasticsearch server logs (from the time when the issue appears in your setup)
Cluster overview, JVM info and threaddumps (from the time when the issue appears in your setup)
- curl localhost:9000/api/cluster
- curl localhost:9000/api/cluster/<node-id>/jvm (for every node in the cluster)
- curl localhost:9000/api/cluster/<node-id>/threaddump (for every node in the cluster)
Graylog server metrics output
- curl localhost:9000/api/system/metrics (for every node in the cluster)
  1. When the Graylog servers work as expected
  2. When the Graylog servers run into the issue
A list of installed plugins
- curl localhost:9000/api/system/plugins
All inputs + all extractors
- curl localhost:9000/api/system/inputs (collect all input ids)
- curl localhost:9000/api/system/inputs/<input-id>/extractors (for every input id)
All streams + streamrules
- curl localhost:9000/api/streams
All drools rules (if you use any - only pre-3.0 setups)
Message processors config
- curl localhost:9000/api/system/messageprocessors/config
All lookup tables, caches and adapters
- curl localhost:9000/api/system/lookup/tables
- curl localhost:9000/api/system/lookup/caches
- curl localhost:9000/api/system/lookup/adapters
All pipelines, pipeline connections and pipeline rules
- curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/pipeline
- curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/rule
- curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/connections
All grok patterns
- curl localhost:9000/api/system/grok

Thank you!

mzhaase commented 5 years ago

Graylog 2.4.7

ES 5.6.14

I collected all other information however there is a lot of business logic and personally identifiable information in the multiple 100ks of lines. Can I send it somewhere per email instead of posting it here?

bernd commented 5 years ago

@mzhaase Feel free to send it to my-github-name at graylog.com.

mzhaase commented 5 years ago

Send

no-response[bot] commented 5 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

sturman commented 5 years ago

Hi @bernd. Is there any update on investigation? I have the same issue with Graylog 2.4.6+ceaa7e4 on graylog (Oracle Corporation 1.8.0_172 on Linux 4.4.0-1044-aws)

bernd commented 5 years ago

@sturman Nope, unfortunately not. I wasn't able to find anything suspicious in the provided information. :confused: This issue also fell through the cracks on my side because it was closed by our "no-response" bot for some reason. Sorry about that, @mzhaase :disappointed:

My only guess is that some broken message is stuck in the front of the journal and that's why message processing doesn't continue. (two people reported that deleting the journal helps) But without anything related to that in the logs I am unable to follow up on this to find the cause, unfortunately.

on3iropolos commented 4 years ago

I think that @bernd is correct with something getting stuck in the journal and/or some extractor rule being overly intensive with a specific log message. I have observed this same problem on our graylog cluster since adding the first GELF input. I have been able to get things rolling again by the following commands:

rm -rf /var/lib/graylog-server/journal/.lock
systemctl restart graylog-server
systemctl status graylog-server

Then it begins processing the journal logs and works for some time, and eventually hangs up again randomly. I am still researching the cause of this.

bernd commented 4 years ago

@on3iropolos @sturman @mzhaase @EpiqSty @mahnunchik @madiTG

We implemented a fix to handle problems with bulk requests that are too big. (This is the "broken pipe" error) Previously, if Graylog tried a bulk index request against Elasticsearch which was too big (http.max_content_length setting - default 100MB), we would retry the request over and over again which results in processing being stopped. This could happen if the batch size is high and the batch contained lots of big, oder fewer really big messages. Also, the http.max_content_length setting is lower in some cloud environments like AWS.

This could be an explanation for the issues you folks saw.

The fix has been implemented in https://github.com/Graylog2/graylog2-server/pull/7071 and will be part of the upcoming 3.2.0 feature and 3.1.4 bug fix releases.

on3iropolos commented 4 years ago

@bernd - That's great news! I will implement 3.1.4 as soon as I am able and report back if the issue still occurs. Thanks!

bernd commented 4 years ago

@on3iropolos Awesome, thank you! :smiley:

on3iropolos commented 4 years ago

@bernd - I have installed the 3.1.4 update. I will update if I experience an issue like the one before.

madiTG commented 4 years ago

@bernd GJ

ilyesAj commented 3 years ago

no clear response for the problem, i'm facing the same issue here , process buffer are filling up , no way to debug this , any suggestions ?

mzhaase commented 3 years ago

I have the suspicion that certain messages create an endless loop in the process buffer processor thus knocking that processor out. Once it happens to all processors, no more messages get processed.

My "solution" to this is to have a cronjob running a python script ever 5m that checks the journal utilization and restarts the server if it is above x. That works pretty well, actually.

ilyesAj commented 3 years ago

i'm also suspecting that , but my main question is how to identify the root problem ? which message is causing the infinite loop , is there some sort of strategy in graylog to ignore those messages ? all those questions are not clear ,

diegoazevedo commented 3 years ago

Same problem here: ES is green but GL is not writing at all.

Restarting the server temporarily resolves the problem.

lijial commented 3 years ago

@bernd Dear bernd,I face the same problem. 3000+msg in , 0 msg output

graylog version: 3.3 , one master, two slave es version: 6.8 ,one master, tow slave mongo version: 3.6 , three nodes there is no error log in graylog/es/mongo, and es indices all green, shards all STARTED, when I restart es/mongo, or delete all journal message in data/journal/messagejournal-0/*， and restart graylog, still 0 msg/s output still, I was got stuck and looking forward to your help as soon as possible

Graylog2 / graylog2-server