Closed mrdavidlaing closed 10 years ago
It turns out that /app/var/log/app-lumberjack_redis.log
was littered with errors that were causing logstash to restart (and restarting uses lots of CPU)
{:timestamp=>"2014-03-02T08:13:19.130000+0000", :message=>"Using milestone 2 output plugin 'redis'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.3.2/plugin-milestones", :level=>:warn}
Exception in thread "LogStash::Runner" org.jruby.exceptions.RaiseException: (ArgumentError) invalid byte sequence in UTF-8
at org.jruby.RubyRegexp.match(org/jruby/RubyRegexp.java:1699)
at RUBY.match(file:/app/vendor/logstash.jar!/grok-pure.rb:144)
at RUBY.decode(file:/app/vendor/logstash.jar!/logstash/codecs/multiline.rb:148)
at RUBY.run(file:/app/vendor/logstash.jar!/logstash/inputs/lumberjack.rb:46)
at org.jruby.RubyProc.call(org/jruby/RubyProc.java:271)
at RUBY.data(file:/app/vendor/logstash.jar!/lumberjack/server.rb:241)
at RUBY.run(file:/app/vendor/logstash.jar!/lumberjack/server.rb:223)
at Lumberjack::Parser.data_field_value(file:/app/vendor/logstash.jar!/lumberjack/server.rb:182)
at Lumberjack::Parser.feed(file:/app/vendor/logstash.jar!/lumberjack/server.rb:93)
at RUBY.compressed_payload(file:/app/vendor/logstash.jar!/lumberjack/server.rb:198)
at Lumberjack::Parser.feed(file:/app/vendor/logstash.jar!/lumberjack/server.rb:93)
at RUBY.run(file:/app/vendor/logstash.jar!/lumberjack/server.rb:220)
at RUBY.run(file:/app/vendor/logstash.jar!/lumberjack/server.rb:59)
rake aborted!
Removing the multiline codec stopped the errors, and reduced the CPU usage:
Viewing the logstash thread names helps to show what part of the logstash config is causing the bottleneck:
(Htop is configured to show thread names in the dump below)
CPU[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 50, 62 thr; 3 running
Mem[||||||||||||||||||||||| 546/3750MB] Load average: 1.85 0.90 0.58
Swp[ 0/0MB] Uptime: 05:29:01
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
922 ubuntu 20 0 37956 1304 1000 S 0.0 0.0 0:00.00 ├─ su - ubuntu -c cd /app/app; export PORT=5400; rake logstash:lumberjack_to_redis >> /app/var/log/lum
924 ubuntu 20 0 19520 1716 1408 S 0.0 0.0 0:00.00 │ └─ -su -c cd /app/app; export PORT=5400; rake logstash:lumberjack_to_redis >> /app/var/log/lumberja
928 ubuntu 20 0 72500 15176 2564 S 0.0 0.4 0:00.21 │ └─ ruby1.9.1 /usr/local/bin/rake logstash:lumberjack_to_redis
935 ubuntu 20 0 72500 15176 2564 S 0.0 0.4 0:00.00 │ ├─ ruby1.9.1
934 ubuntu 20 0 4404 608 512 S 0.0 0.0 0:00.00 │ └─ sh -c TMP=/app/tmp /usr/bin/java -jar '/app/vendor/logstash.jar' agent -w 1 -f '/app/app/s
936 ubuntu 20 0 1802M 215M 16220 S 86.0 5.7 2:13.31 │ └─ /usr/bin/java -jar /app/vendor/logstash.jar agent -w 1 -f /app/app/srv/logstash/config/
1202 ubuntu 20 0 1802M 215M 16220 R 15.0 5.7 0:15.10 │ ├─ <lumberjack
1201 ubuntu 20 0 1802M 215M 16220 S 15.0 5.7 0:15.21 │ ├─ <lumberjack
1001 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.00 │ ├─ >output
1000 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.19 │ ├─ >output
999 ubuntu 20 0 1802M 215M 16220 S 17.0 5.7 0:19.39 │ ├─ >output
998 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.03 │ ├─ LogStash::Runne
997 ubuntu 20 0 1802M 215M 16220 R 38.0 5.7 0:41.71 │ ├─ |worker
996 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.21 │ ├─ <lumberjack
995 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.60 │ ├─ <lumberjack
994 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:01.41 │ ├─ <lumberjack
993 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.30 │ ├─ <lumberjack
957 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:07.02 │ ├─ LogStash::Runne
956 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.00 │ ├─ java
949 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.22 │ ├─ java
947 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.27 │ ├─ java
945 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.30 │ ├─ java
944 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.00 │ ├─ java
943 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:09.17 │ ├─ java
942 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:09.08 │ ├─ java
941 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.00 │ ├─ java
940 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.21 │ ├─ java
939 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:00.06 │ ├─ java
938 ubuntu 20 0 1802M 215M 16220 S 2.0 5.7 0:05.29 │ ├─ java
937 ubuntu 20 0 1802M 215M 16220 S 0.0 5.7 0:07.36 │ └─ LogStash::Runne
Turns out that adding back multiline as a filter (rather than a codec) seems to work... but it does use lots of CPU
I think we must separate the ingestor and queue VMs; so we can scale them separately.
In specific, we should probably have 2 ingestor nodes; and utilize logstash-forwarder's round robin loadbalancing.
Thank you for researching this further and fixing it!
I think splitting the ingestor/queues may be premature, although it probably would be very smart to separate the hostnames we're using for future use. I don't think it's currently our bottleneck, even when under very high utilization. It can be a bottleneck, but I think there are bigger ones further down the line, like elasticsearch/disk throughput or logstash cpu usage.
For round-robin load-balancing, we'll need to consider how we do that dynamically.
I don't think it's currently our bottleneck
I'm pretty sure it is - every time there is a spike in load & the queue node's CPU maxes out, its the lumberjack ingestor that is using all the CPU.
Digging into New Relic:
it seems quite clear to me that when we get a spike in network traffic, we get a spike in the CPU usage of java / logstash. When-ever I've manually run htop during a spike today its been the lumberjack ingestors using the CPU; as shown above
Concerning round-robin.
To clarify, based on Friday's experience I was more meaning: while the redis node may have been maxed out, the queue was still rapidly growing; and, given the queue was functional, further down the lines we weren't able to keep up with that queue, and that's what I was considering to be the bigger bottleneck.
I see what you're saying. At any rate; in the short term I don't think we should do anything more than increase the queue instance size to give it more CPUs.
There doesn't seem to be much we can do to reduce the lumberjack ingestor CPU usage in the short term. Closing this as won't fix for now
I'm running into this issue (I think). I just spun my ingestor (a logstash instance that simply takes Lumberjack and dumps it into Redis) out into a separate instance and Lumberjack is always using all of the CPU:
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.8%] Tasks: 31, 114 thr; 5 running
2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||95.7%] Load average: 3.94 3.88 3.76
Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||342/1692MB] Uptime: 05:30:53
Swp[ 0/895MB]
PID USER PRI NI VIRT RES SHR S CPU% MEM% PID Command
15119 root 20 0 1867M 187M 13852 S 192. 11.1 15119 /usr/bin/java -Xmx500m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -jar /usr/local/liveview-shipping/logstash-1.4.0/v
15626 root 20 0 1867M 187M 13852 R 58.1 11.1 15626 <lumberjack
15633 root 20 0 1867M 187M 13852 R 42.5 11.1 15633 <lumberjack
15634 root 20 0 1867M 187M 13852 R 40.3 11.1 15634 <lumberjack
15159 root 20 0 1867M 187M 13852 R 39.9 11.1 15159 <lumberjack
I don't have a ton of traffic (I don't think): about 10 nodes are pushing around one event per second right now (total of 10 connections, 10 logs/second). Is this normal performance? Anything I can do to dig into those threads and see where they're spending their time? This is an EC2 c1.medium instance.
@ryno2019 - in our case the CPU gobbler turned out to be incorporating the multiline filter codec inside the lumberjack input.
When we moved that out into a regular multiline filter the CPU usage seemed to reduce.
What does your logstash config for the lumberjack ingestor look like?
@mrdavidlaing - Thanks for getting back! My input config has only the following:
lumberjack {
port => 3333
type => "tcpssl"
ssl_certificate => "./logstash-agent.crt"
ssl_key => "./logstash-agent.key"
tags => ["lumber"]
codec => "plain"
}
The log messages coming in are JSON blobs from a custom application we wrote.
@ryno2019 - Have you tried removing the tags and codec lines; and seeing if those make any difference to CPU usage?
I'll try that now. I restarted Logstash without those lines and the utilization looks much better, but sometimes it seems to ramp back up over time, so I'll let you know what happens throughout the day.
Isn't "plain" the default codec? And if so, any idea why specifying it makes any difference versus leaving the default? Also with "tags": why could adding a simple tag eat so much CPU?
Thanks again for your help!
It's looking quite stable now, although I did run into one issue: I switched a bunch of my reporting agents all at once from one of my "ingestor" servers to the other and the CPU (on the Logstash server) went right to 100% and seemed to stay there. Rebooting Logstash brought it back down and it looks stable now, but do you have any idea what might cause that?
Is there any easy way to debug the Lumberjack input to check for queue sizes and see where it's spending most of it's time?
@ryno2019 - I'm afraid that you have just become as much of an expert as me. If you find out how to get better perf info for the lumberjack input; please let me know!
Thanks! I let it run for a while and again: at some point–for no obvious reason–the CPU jumped back to 100% and stayed there. I'll see if I can dig in a little more to see what Lumberjack is spending it's time on...
@ryno2019 - are you seeing any "crash" style errors in the logstash logs? I know that when logstash starts up it uses 100% CPU (I presume because JRuby is compiling stuff); so if its crashing regularly then that might be the cause of the CPU usage.
@mrdavidlaing - Hmm, no. Nothing in the Logstash logs. I started it yesterday afternoon, and at 2:00am this morning Lumberjack started using 100% of one core. It's continued to run without spitting out any logs whatsoever. :/ I'll try upping the verbosity to see if I can get any insight...
Since rolling out multiline support for lumberjack messages - #351 - the redis node where the lumberjack ingestor runs has been using 100% CPU.
Figure out why. Stop it.