Usual input load around 7000 msg/s - output works intermitently

thebox-dev commented 8 years ago

Problem description

We have the following setup in AWS: 3 x graylog server instances started from the official graylog AMI version 1.3.3.

instance_1 is configured to run graylog server + mongoDB - type: "c4.4xlarge"
instance_2 is configured to run elasticsearch only - type: "r3.xlarge"
instance_3 is configured to run graylog-web interface - type: "c4.large"

During the night, the usual throughput on the input is around 6000 msg/s and pretty constant.

Steps to reproduce the problem

We noticed the issue only yesterday, two weeks after we installed the system. During the day, under higher load, around 8000 msg/s, we noticed the journal utilization increasing steadily while the messages were not getting consumed at a constant rate, meaning that there are large pauses in the outputs, like 20-25 seconds of complete stops - 0 msg/s and then output rates of 12.000-15.000 msg/s for 10 seconds followed again by a complete stop for another 20-25 seconds. The behavior is pretty stable ever since we noticed it. At some point, during the night, when the traffic returns to the usual values, the behavior is similar but the journal utilization gets much lower, down to 2-3%.

We increased the java heap size on the instances and the number of FDs, we checked the CPU and memory and saw nothing out of the ordinary. The network connectivity looks sharp, the I/O of the disks is good since we are only using SSDs (2 TB ~ 6000 I/O on the elasticsearch instance). Everything looks right to me, but those long pauses of over 20 seconds, with CPUs down to 2-3% drive me nuts.

Please let me know if it sounds familiar to you and help me troubleshoot this issue. Many thanks!

Environment

Graylog Version: 1.3.3
Elasticsearch Version: 1.7.4
MongoDB Version: v3.0.6
Operating System: Ubuntu 14.04.1 LTS

the-Pretender commented 8 years ago

Noticed similar behaviour on our testing system. As this was under heavy load, both disk (not ssd) and CPU while testing EPS, I initially thought it was disk problems causing the hickups. But since others are experiencing the same without said signs, maybe there is an underlying issue. In my scenario, I was testing parsing of Checkpoint logs at 8k msg/s, which worked fine for ~20 minutes, before the journal started filling up, I then turned down the msg/s by 1K at the time, until it stabilized at 5K.

kroepke commented 8 years ago

Have you enabled the garbage collection log for elasticsearch? This sounds like a GC problem (esp because the intermittent pauses).

thebox-dev commented 8 years ago

Not related to elasticsearch apparently. It looks like on the GL server instance the processing buffers were full so I added some more in the config and also doubled the ring size. But the thing that fixed the issue was removing the default 14 extractors that came with GL. Now GL runs fine with the mentioned load and the journal stays on the 1-2% at most.

thebox-dev commented 8 years ago

I have one question though: is the web-interface timeout parameter able to take higher values than 60s? I get these errors in the logs when doing heavy searches, even though I explicitly increased the timeout above 60s:

2016-02-18 21:55:59,947 - [ERROR] - from org.graylog2.restclient.lib.ApiClient in play-akka.actor.default-dispatcher-84
REST call failed
java.util.concurrent.TimeoutException: Read timeout to /172.31.7.18:12900 of 60000 ms
        at com.ning.http.client.providers.netty.request.timeout.TimeoutTimerTask.expire(TimeoutTimerTask.java:47) ~[com.ning.async-http-client-1.9.31.jar:na]
        at com.ning.http.client.providers.netty.request.timeout.ReadTimeoutTimerTask.run(ReadTimeoutTimerTask.java:57) ~[com.ning.async-http-client-1.9.31.jar:na]
        at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:556) ~[io.netty.netty-3.10.4.Final.jar:na]
        at org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:632) ~[io.netty.netty-3.10.4.Final.jar:na]
        at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:369) ~[io.netty.netty-3.10.4.Final.jar:na]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[io.netty.netty-3.10.4.Final.jar:na]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_66]

joschi commented 8 years ago

@thebox-dev Unfortunately this specific timeout is hard coded in Graylog 1.x.

thebox-dev commented 8 years ago

Actually, I found out that it can be adjusted:

https://github.com/Graylog2/graylog2-web-interface/issues/1679

But I cannot seem to be allowed to adjust it to values higher than 60s.

joschi commented 8 years ago

Actually, I found out that it can be adjusted

No, you didn't. That specific timeout is hard coded. Feel free to check out the source code I've linked to in my last reply.

thebox-dev commented 8 years ago

I see... thanks for clarifying.

Is there a way to get around this currently? The network communication seems fine between all three components, the GL server instance seems OK on resources, but the ES instance is going to 99% CPU when doing the searches for specific log entries. I think GL server waits on ES to provide the searched data (I adjusted the elasticsearch_timeout to 3 minutes) and the web-interface instance simply times out when the wait is longer than 1 min.

joschi commented 8 years ago

Is there a way to get around this currently?

Not without forking Graylog and making that timeout user-configurable.

jalogisch commented 7 years ago

@thebox-dev please update to the latest Version. Your issue should be fixed.

Graylog2 / graylog2-server