Closed madiTG closed 4 years ago
This seems to be the same problem as in https://github.com/Graylog2/graylog2-server/issues/3553
Some thread data from JMX:
Seeing the same issue here, lowered from 10 outputprocessors to 5 and raised the batch size to 1500 but no luck.
Its Processing for a few seconds, then dying again.
Graylog 2.3.2+3df951e on syslog.xyz.local (Oracle Corporation 1.8.0_144 on Linux 3.10.0-693.2.2.el7.x86_64)
We need the complete Graylog and Elasticsearch logs to be able to start debugging this.
Experiencing exactly the same issue here. Above a certain threshold, seeing exact same results as madiTG. Threshold varies depending on various graylog config options related to output buffers, but in the end, graylog hangs and has to be restarted, no problems with ES, no errors in log.
As the https://community.graylog.org/t/graylog-stops-processing-writing-messages/2719 comments indicate, logs are empty.
@bernd its a little bit late for logs, but both logs were empty. Otherwise i would have posted it all along the issue. As @davemcphee said, it simply stops processing messages and begins writing it to all buffers and journal, but they will never be processed.
@davemcphee @gruselglatz Okay, thanks for the update. After reading the topic in the community forum again, it looks like some message processing is blocking. Since the output buffers are basically empty but the process buffer is full, Graylog cannot process any more messages.
To have a chance to debug this further we need the following information:
curl localhost:9000/api/system/plugins
curl localhost:9000/api/system/inputs
(collect all input ids)curl localhost:9000/api/system/inputs/<input-id>/extractors
(for every input id)curl localhost:9000/api/streams
curl localhost:9000/api/system/messageprocessors/config
curl localhost:9000/api/system/lookup/tables
curl localhost:9000/api/system/lookup/caches
curl localhost:9000/api/system/lookup/adapters
curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/pipeline
curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/rule
curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/connections
curl localhost:9000/api/system/grok
Feel free to send me this information as a private message in the community forum if you don't want to share all of this in public.
Hi all. This problem torments me for a month. I had to add new nodes to reduce the load. I'm using version 2.2.3. I hope this bug will fix soon. Tell me if you need more information about my instance.
Moin @bernd ,
The GL community doesn't trust me enough to PM you random links yet, so here you are: https://www.dropbox.com/s/pu5r9yfz8x0oyyt/graylog_debug.tar.gz?dl=0
Very little of interest there I'm afraid. It's a uat system so only one "real" input: the Syslog UDP listener, which is listening for nginx access logs. These logs are in json format, but wrapped in an rsyslog header, so I use a regex to cut out the json string, then key/value extract that json. The UAT system can generate 20k msg/s and bursts up to 40k, but we can (and ofc do) throttle this.
I have elasticsearch and graylog prometheus exporters running, and would be happy to test various config changes, tweak message throttling, and show results.
@davemcphee Thank you for the information! As you said, there isn't that much in there, unfortunately.
Can you send me a complete thread dump of the Graylog JVM process when this happens again? If processing is blocked, wait a few minutes and then take a thread dump. Also please do a curl localhost:9000/api/system/metrics
after you took the dump and include it as well.
How is the CPU load when the processing is blocked? If you have some graphs for that as well, it would be much appreciated.
Could you show us an example raw message with a typical JSON payload?
Thank you!
@bernd please check the community forum (referenced here) for the CPU charts. There are some thread dumps out there too.
You may have worked through your situation already, but what I found when I encountered a similar scenario earlier this year is that under certain message conditions and ingest loads my regex extractors were taking an exorbitantly long time to run (relatively speaking). Took awhile to narrow that down but once I was on that track I was able to use the metrics for individual extractors to gauge my success in tuning. I went through each of my extractors and tuned the regex, dropping execution times from like 50,000μs to 50μs. I had to move away from using "any character" matches to specific matches, and also be much more intentional with my extractor condition matches. At worst last Spring my Graylog nodes were backing up by > 1 million events each, several times per week. Today my ingest/search rates are higher than back then but I almost never have backups. The only time I have backups now (small ones at that) is when a given system or device goes haywire and floods with 10x normal message rate for a few minutes.
I only use json extractors, tried with pipelines turned off - same effect. Already went through regex optimalisation. Execution times trend to 0 as nothing is being processed.
@madiTG I read the topic in the community forum and saw your graphs. The thread dumps were not complete and contained only the part for the outputbuffer threads. Also, I guess your setup is not the same as @davemcphee's so his CPU graphs might be different. :wink:
I had never used jstack before so after capturing one or two thread dumps I began playing with some config settings and seeing if I could observe a difference in the jstack output, and ummmm ... I think I've fixed my issue :grimacing: Isn't that just typical ....
I can't confirm anything and need to test further (and probably not before end of year) but I believe it had something to do with size of my ring_size
and inputbuffer_ring_size
- I reduced them massively.
I'm currently running with ring_size=8192
and inputbuffer_ring_size=32768
, and 12 output processors, batch size 12k, and am able to process 30K msg/s, bursting to over 40K, on a single machine.
These numbers did require tuning heap of both ES and Graylog extensively (less is more), along with number of shards per ES index (more is more). I'll keep an eye on it and report back if I have anything to add.
@davemcphee
hey dave, got any update on that?
maybe related to https://github.com/Graylog2/graylog2-server/issues/4614 ?
Still happening:
@berndleinfelder
Did the cancel thing, nothing happened
thread dump here:
org.graylog2.buffers.process.size Gauge
Value: 8,192
org.graylog2.buffers.process.usage Gauge
Value: 8,192
org.graylog2.shared.buffers.ProcessBuffer.decodeTime Timer
95th percentile: 33μs 98th percentile: 47μs 99th percentile: 55μs Standard deviation: 7μs Mean: 21μs Minimum: 15μs Maximum: 97μs
org.graylog2.shared.buffers.ProcessBuffer.incomingMessages Meter
Total: 189,234,468 events Mean: 1,036.23 events/second 1 minute avg: 0 events/second 5 minute avg: 62.37 events/second 15 minute avg: 634.78 events/second
org.graylog2.shared.buffers.ProcessBuffer.parseTime Timer
95th percentile: 11μs 98th percentile: 23μs 99th percentile: 34μs Standard deviation: 4μs Mean: 7μs Minimum: 3μs Maximum: 56μs
org.graylog2.shared.buffers.ProcessBuffer.thread-factory.created Meter
Total: 4 events Mean: 0 events/second 1 minute avg: 0 events/second 5 minute avg: 0 events/second 15 minute avg: 0 events/second
org.graylog2.shared.buffers.ProcessBuffer.thread-factory.running Counter
Value: 4
org.graylog2.shared.buffers.ProcessBuffer.thread-factory.terminated Meter
Total: 0 events Mean: 0 events/second 1 minute avg: 0 events/second 5 minute avg: 0 events/second 15 minute avg: 0 events/second
org.graylog2.shared.buffers.processors.ProcessBufferProcessor.filteredOutMessages Meter
Total: 0 events Mean: 0 events/second 1 minute avg: 0 events/second 5 minute avg: 0 events/second 15 minute avg: 0 events/second
org.graylog2.shared.buffers.processors.ProcessBufferProcessor.incomingMessages Meter
Total: 189,234,468 events Mean: 1,036.25 events/second 1 minute avg: 0 events/second 5 minute avg: 62.84 events/second 15 minute avg: 636.41 events/second
org.graylog2.shared.buffers.processors.ProcessBufferProcessor.outgoingMessages Meter
Total: 189,234,467 events Mean: 1,036.25 events/second 1 minute avg: 0 events/second 5 minute avg: 62.84 events/second 15 minute avg: 636.41 events/second
org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime Timer
95th percentile: 730μs 98th percentile: 1,298μs 99th percentile: 1,993μs Standard deviation: 821μs Mean: 447μs Minimum: 225μs Maximum: 17,226μs
Changed my config values to those provided by dave
Restarting node solved the issue
Anybody found a solution?
In my logs I only have tons of:
2018-05-23T10:25:47.339Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #70).
2018-05-23T10:25:47.860Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #30).
2018-05-23T10:26:06.049Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #70).
2018-05-23T10:26:10.231Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:18.574Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:19.230Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #31).
2018-05-23T10:26:37.210Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:41.180Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:26:49.858Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:26:50.438Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #32).
2018-05-23T10:27:08.339Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:27:12.147Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:21.060Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:21.591Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #33).
2018-05-23T10:27:39.468Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:43.102Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:27:52.301Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:27:52.737Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #34).
2018-05-23T10:28:10.706Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:28:14.151Z INFO [Messages] Bulk indexing finally successful (attempt #75).
@drAlberT Your Elasticsearch cluster is down or unreachable.
Please post this issue to our discussion forum or join the #graylog channel on freenode IRC.
Thank you!
Anyway, I'm not so sure:
root@graylog-a:~# curl -L https://vpc-graylog-22XXXXXXXXXXXjsuie4.eu-west-1.es.amazonaws.com
{
"name" : "jS4Ja1Q",
"cluster_name" : "903437758265:graylog",
"cluster_uuid" : "nWkEX1wmRcaCuh5MDePpTA",
"version" : {
"number" : "5.5.2",
"build_hash" : "Unknown",
"build_date" : "2017-10-18T04:35:01.381Z",
"build_snapshot" : false,
"lucene_version" : "6.6.0"
},
"tagline" : "You Know, for Search"
}
So the ES cluster is alive, the graylog node can reach it, I double-checked GL conf and the clustername is right .. after a bit of changes I got the following trace .. it should be saying I need to increase the Java heap of GL process .. right?
2018-05-24T16:58:33.364Z ERROR [Messages] Caught exception during bulk indexing: java.lang.OutOfMemoryError: Java heap space, retrying (attempt #1).
2018-05-24T16:58:33.386Z ERROR [Messages] Couldn't bulk index 500 messages.
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at com.github.rholder.retry.Retryer$ExceptionAttempt.<init>(Retryer.java:254) ~[graylog.jar:?]
at com.github.rholder.retry.Retryer.call(Retryer.java:163) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) [graylog.jar:?]
at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:152) [graylog.jar:?]
at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:128) [graylog.jar:?]
at org.graylog2.outputs.ElasticSearchOutput.writeMessageEntries(ElasticSearchOutput.java:111) [graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:129) [graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.writeMessageEntry(BlockingBatchedESOutput.java:110) [graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:92) [graylog.jar:?]
at org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:191) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_171]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_171]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_171]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_171]
at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_171]
at java.lang.StringBuilder.append(StringBuilder.java:131) ~[?:1.8.0_171]
at io.searchbox.core.Bulk.getData(Bulk.java:108) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.prepareRequest(JestHttpClient.java:134) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:75) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:64) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.lambda$runBulkRequest$2(Messages.java:169) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages$$Lambda$231/1508904764.call(Unknown Source) ~[?:?]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[graylog.jar:?]
at com.github.rholder.retry.Retryer.call(Retryer.java:160) ~[graylog.jar:?]
... 14 more
2018-05-24T16:58:33.394Z ERROR [BlockingBatchedESOutput] Unable to flush message buffer
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:176) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:152) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:128) ~[graylog.jar:?]
at org.graylog2.outputs.ElasticSearchOutput.writeMessageEntries(ElasticSearchOutput.java:111) ~[graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:129) [graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.writeMessageEntry(BlockingBatchedESOutput.java:110) [graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:92) [graylog.jar:?]
at org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:191) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_171]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at com.github.rholder.retry.Retryer$ExceptionAttempt.<init>(Retryer.java:254) ~[graylog.jar:?]
at com.github.rholder.retry.Retryer.call(Retryer.java:163) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) ~[graylog.jar:?]
... 13 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_171]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_171]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_171]
at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_171]
at java.lang.StringBuilder.append(StringBuilder.java:131) ~[?:1.8.0_171]
at io.searchbox.core.Bulk.getData(Bulk.java:108) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.prepareRequest(JestHttpClient.java:134) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:75) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:64) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.lambda$runBulkRequest$2(Messages.java:169) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages$$Lambda$231/1508904764.call(Unknown Source) ~[?:?]
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[graylog.jar:?]
at com.github.rholder.retry.Retryer.call(Retryer.java:160) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) ~[graylog.jar:?]
... 13 more
2018-05-24T16:58:34.350Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #1).
Having increased it to 3G removed the Heap error .. but not the original issue:
2018-05-24T17:05:13.607Z INFO [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2018-05-24T17:05:13.632Z INFO [InputStateListener] Input [Syslog UDP/5a76da165c6f490910ffbc15] is now STARTING
2018-05-24T17:05:13.639Z INFO [InputStateListener] Input [Syslog TCP/5a76e2b7c228180627417070] is now STARTING
2018-05-24T17:05:13.642Z INFO [InputStateListener] Input [Beats/5a859f85d0cf6e4b7317d648] is now STARTING
2018-05-24T17:05:13.678Z INFO [InputStateListener] Input [Beats/5a859f85d0cf6e4b7317d648] is now RUNNING
2018-05-24T17:05:13.681Z INFO [InputStateListener] Input [Syslog TCP/5a76e2b7c228180627417070] is now RUNNING
2018-05-24T17:05:13.709Z INFO [InputStateListener] Input [Syslog UDP/5a76da165c6f490910ffbc15] is now RUNNING
2018-05-24T17:05:23.928Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #1).
2018-05-24T17:05:26.594Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #2).
2018-05-24T17:05:28.522Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #3).
2018-05-24T17:05:30.235Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #4).
2018-05-24T17:05:31.623Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #5).
2018-05-24T17:05:33.120Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #6).
2018-05-24T17:05:34.664Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #7).
Any changes in this, we are getting about 6 messages a second but nothing is being processed there are no errors in the logs and everything is healthy same context as this. This seems to be a bug as we are not running much configuration differences from the default.
@Geramy you should move over to the community. Please post this issue to our discussion forum or join the #graylog channel on freenode IRC.
The same issue. ES is green. Mongo without any error. All messages stuck in journal.
Graylog 2.4 Mongo 3.6 ES 5.6.12
I've solved the problem by removing graylog journal data. Messages started being indexed immediately.
I am closing this issue now. We didn't manage to find the root cause unfortunately, but it also doesn't make sense to keep it open. Please visit the discussion forum for help if you run into this again or create a new issue with updated information. (logs, metrics, etc.) Also please make sure to run the latest Graylog stable release.
Thank you!
We just faced with the same issue today.
Graylog node have stopped processing messages. Meanwhile there were no issues with the ES, no issued with the Disk, space, Memory amount or the CPU load on the affected node. It just got stuck. Processing buffer was full, but output buffer was empty.
So, I've disabled the affected node at the LB, in order to prevent messages queue groving and started googling.
And I've arrived here from the https://community.graylog.org/t/graylog-stops-processing-writing-messages/2719/10
And at the end found that bug closed without any resolution but with the advice to go back to community.
So looks like nobody's intrested in the solution, that's sad.
One good thing in our case, after 40-60 minutes of being in the stuck state, node started processing messages again.
This happens a multiple times a day for us. The only thing that helps is to delete the journal. We are losing a couple million messages a day to this. Not mentioning the countless hours of dev and ops time wasted on this.
This renders our graylog cluster all but unusable. If no solution is found we probably have to move our org away from graylog.
@EpiqSty @mzhaase I will re-open this. We need the following information to have a chance to debug this:
curl localhost:9000/api/cluster
curl localhost:9000/api/cluster/<node-id>/jvm
(for every node in the cluster)curl localhost:9000/api/cluster/<node-id>/threaddump
(for every node in the cluster)curl localhost:9000/api/system/metrics
(for every node in the cluster)
curl localhost:9000/api/system/plugins
curl localhost:9000/api/system/inputs
(collect all input ids)curl localhost:9000/api/system/inputs/<input-id>/extractors
(for every input id)curl localhost:9000/api/streams
curl localhost:9000/api/system/messageprocessors/config
curl localhost:9000/api/system/lookup/tables
curl localhost:9000/api/system/lookup/caches
curl localhost:9000/api/system/lookup/adapters
curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/pipeline
curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/rule
curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/connections
curl localhost:9000/api/system/grok
Thank you!
Graylog 2.4.7
ES 5.6.14
I collected all other information however there is a lot of business logic and personally identifiable information in the multiple 100ks of lines. Can I send it somewhere per email instead of posting it here?
@mzhaase Feel free to send it to my-github-name
at graylog.com.
Send
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.
Hi @bernd. Is there any update on investigation? I have the same issue with Graylog 2.4.6+ceaa7e4 on graylog (Oracle Corporation 1.8.0_172 on Linux 4.4.0-1044-aws)
@sturman Nope, unfortunately not. I wasn't able to find anything suspicious in the provided information. :confused: This issue also fell through the cracks on my side because it was closed by our "no-response" bot for some reason. Sorry about that, @mzhaase :disappointed:
My only guess is that some broken message is stuck in the front of the journal and that's why message processing doesn't continue. (two people reported that deleting the journal helps) But without anything related to that in the logs I am unable to follow up on this to find the cause, unfortunately.
I think that @bernd is correct with something getting stuck in the journal and/or some extractor rule being overly intensive with a specific log message. I have observed this same problem on our graylog cluster since adding the first GELF input. I have been able to get things rolling again by the following commands:
rm -rf /var/lib/graylog-server/journal/.lock
systemctl restart graylog-server
systemctl status graylog-server
Then it begins processing the journal logs and works for some time, and eventually hangs up again randomly. I am still researching the cause of this.
@on3iropolos @sturman @mzhaase @EpiqSty @mahnunchik @madiTG
We implemented a fix to handle problems with bulk requests that are too big. (This is the "broken pipe" error) Previously, if Graylog tried a bulk index request against Elasticsearch which was too big (http.max_content_length
setting - default 100MB), we would retry the request over and over again which results in processing being stopped. This could happen if the batch size is high and the batch contained lots of big, oder fewer really big messages. Also, the http.max_content_length
setting is lower in some cloud environments like AWS.
This could be an explanation for the issues you folks saw.
The fix has been implemented in https://github.com/Graylog2/graylog2-server/pull/7071 and will be part of the upcoming 3.2.0 feature and 3.1.4 bug fix releases.
@bernd - That's great news! I will implement 3.1.4 as soon as I am able and report back if the issue still occurs. Thanks!
@on3iropolos Awesome, thank you! :smiley:
@bernd - I have installed the 3.1.4 update. I will update if I experience an issue like the one before.
@bernd GJ
no clear response for the problem, i'm facing the same issue here , process buffer are filling up , no way to debug this , any suggestions ?
I have the suspicion that certain messages create an endless loop in the process buffer processor thus knocking that processor out. Once it happens to all processors, no more messages get processed.
My "solution" to this is to have a cronjob running a python script ever 5m that checks the journal utilization and restarts the server if it is above x. That works pretty well, actually.
i'm also suspecting that , but my main question is how to identify the root problem ? which message is causing the infinite loop , is there some sort of strategy in graylog to ignore those messages ? all those questions are not clear ,
Same problem here: ES is green but GL is not writing at all.
Restarting the server temporarily resolves the problem.
@bernd Dear bernd,I face the same problem. 3000+msg in , 0 msg output
graylog version: 3.3 , one master, two slave es version: 6.8 ,one master, tow slave mongo version: 3.6 , three nodes there is no error log in graylog/es/mongo, and es indices all green, shards all STARTED, when I restart es/mongo, or delete all journal message in data/journal/messagejournal-0/*, and restart graylog, still 0 msg/s output still, I was got stuck and looking forward to your help as soon as possible
Expected Behavior
I expect messages to be moved from the processor buffer to output buffer constantly
Current Behavior
Processor buffer is being filled and output buffer stays empty, nothing is being written to database
Possible Solution
Possibly a thread lock or something like that
Steps to Reproduce (for bugs)
I reproduce this putting a lot of messages into graylog, it works for some time and then stops to send messages to ES cluster. I have no idea how can You reproduce this
Context
Pushing a large amount cca 10k/s high volume messages to graylog
Thread dump of affected node: Thread dump of node 6e52bb4b / graylog2 Taken at Wed Nov 08 2017 11:32:36 GMT+0100
Your Environment