lumberjack ingestor/endpoint - using 100% CPU

mrdavidlaing commented 10 years ago

Since rolling out multiline support for lumberjack messages - #351 - the redis node where the lumberjack ingestor runs has been using 100% CPU.

Figure out why. Stop it.

mrdavidlaing commented 10 years ago

It turns out that /app/var/log/app-lumberjack_redis.log was littered with errors that were causing logstash to restart (and restarting uses lots of CPU)

{:timestamp=>"2014-03-02T08:13:19.130000+0000", :message=>"Using milestone 2 output plugin 'redis'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.3.2/plugin-milestones", :level=>:warn}
Exception in thread "LogStash::Runner" org.jruby.exceptions.RaiseException: (ArgumentError) invalid byte sequence in UTF-8
    at org.jruby.RubyRegexp.match(org/jruby/RubyRegexp.java:1699)
    at RUBY.match(file:/app/vendor/logstash.jar!/grok-pure.rb:144)
    at RUBY.decode(file:/app/vendor/logstash.jar!/logstash/codecs/multiline.rb:148)
    at RUBY.run(file:/app/vendor/logstash.jar!/logstash/inputs/lumberjack.rb:46)
    at org.jruby.RubyProc.call(org/jruby/RubyProc.java:271)
    at RUBY.data(file:/app/vendor/logstash.jar!/lumberjack/server.rb:241)
    at RUBY.run(file:/app/vendor/logstash.jar!/lumberjack/server.rb:223)
    at Lumberjack::Parser.data_field_value(file:/app/vendor/logstash.jar!/lumberjack/server.rb:182)
    at Lumberjack::Parser.feed(file:/app/vendor/logstash.jar!/lumberjack/server.rb:93)
    at RUBY.compressed_payload(file:/app/vendor/logstash.jar!/lumberjack/server.rb:198)
    at Lumberjack::Parser.feed(file:/app/vendor/logstash.jar!/lumberjack/server.rb:93)
    at RUBY.run(file:/app/vendor/logstash.jar!/lumberjack/server.rb:220)
    at RUBY.run(file:/app/vendor/logstash.jar!/lumberjack/server.rb:59)
rake aborted!

Removing the multiline codec stopped the errors, and reduced the CPU usage:

mrdavidlaing commented 10 years ago

Viewing the logstash thread names helps to show what part of the logstash config is causing the bottleneck:

(Htop is configured to show thread names in the dump below)

CPU[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 50, 62 thr; 3 running
  Mem[|||||||||||||||||||||||                                       546/3750MB]     Load average: 1.85 0.90 0.58 
  Swp[                                                                   0/0MB]     Uptime: 05:29:01

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  922 ubuntu     20   0 37956  1304  1000 S  0.0  0.0  0:00.00 ├─ su - ubuntu -c cd /app/app; export PORT=5400; rake logstash:lumberjack_to_redis >> /app/var/log/lum
  924 ubuntu     20   0 19520  1716  1408 S  0.0  0.0  0:00.00 │  └─ -su -c cd /app/app; export PORT=5400; rake logstash:lumberjack_to_redis >> /app/var/log/lumberja
  928 ubuntu     20   0 72500 15176  2564 S  0.0  0.4  0:00.21 │     └─ ruby1.9.1 /usr/local/bin/rake logstash:lumberjack_to_redis
  935 ubuntu     20   0 72500 15176  2564 S  0.0  0.4  0:00.00 │        ├─ ruby1.9.1
  934 ubuntu     20   0  4404   608   512 S  0.0  0.0  0:00.00 │        └─ sh -c TMP=/app/tmp /usr/bin/java -jar '/app/vendor/logstash.jar' agent -w 1 -f '/app/app/s
  936 ubuntu     20   0 1802M  215M 16220 S 86.0  5.7  2:13.31 │           └─ /usr/bin/java -jar /app/vendor/logstash.jar agent -w 1 -f /app/app/srv/logstash/config/
 1202 ubuntu     20   0 1802M  215M 16220 R 15.0  5.7  0:15.10 │              ├─ <lumberjack
 1201 ubuntu     20   0 1802M  215M 16220 S 15.0  5.7  0:15.21 │              ├─ <lumberjack
 1001 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.00 │              ├─ >output
 1000 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.19 │              ├─ >output
  999 ubuntu     20   0 1802M  215M 16220 S 17.0  5.7  0:19.39 │              ├─ >output
  998 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.03 │              ├─ LogStash::Runne
  997 ubuntu     20   0 1802M  215M 16220 R 38.0  5.7  0:41.71 │              ├─ |worker
  996 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.21 │              ├─ <lumberjack
  995 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.60 │              ├─ <lumberjack
  994 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:01.41 │              ├─ <lumberjack
  993 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.30 │              ├─ <lumberjack
  957 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:07.02 │              ├─ LogStash::Runne
  956 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.00 │              ├─ java
  949 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.22 │              ├─ java
  947 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.27 │              ├─ java
  945 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.30 │              ├─ java
  944 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.00 │              ├─ java
  943 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:09.17 │              ├─ java
  942 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:09.08 │              ├─ java
  941 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.00 │              ├─ java
  940 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.21 │              ├─ java
  939 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:00.06 │              ├─ java
  938 ubuntu     20   0 1802M  215M 16220 S  2.0  5.7  0:05.29 │              ├─ java
  937 ubuntu     20   0 1802M  215M 16220 S  0.0  5.7  0:07.36 │              └─ LogStash::Runne

mrdavidlaing commented 10 years ago

Turns out that adding back multiline as a filter (rather than a codec) seems to work... but it does use lots of CPU

mrdavidlaing commented 10 years ago

I think we must separate the ingestor and queue VMs; so we can scale them separately.

In specific, we should probably have 2 ingestor nodes; and utilize logstash-forwarder's round robin loadbalancing.

dpb587 commented 10 years ago

Thank you for researching this further and fixing it!

I think splitting the ingestor/queues may be premature, although it probably would be very smart to separate the hostnames we're using for future use. I don't think it's currently our bottleneck, even when under very high utilization. It can be a bottleneck, but I think there are bigger ones further down the line, like elasticsearch/disk throughput or logstash cpu usage.

For round-robin load-balancing, we'll need to consider how we do that dynamically.

I'd highly prefer we not hard-code IPs in lumberjack configuration
alternatively, we start using EIP's
I'd prefer round-robin DNS; logstash has some issues with that, but I'm not sure if logstash-forwarder does
round-robin nature of logstash-forwarder assumes direct port access; we're currently using a tunnel which complicates that

mrdavidlaing commented 10 years ago

I don't think it's currently our bottleneck

I'm pretty sure it is - every time there is a spike in load & the queue node's CPU maxes out, its the lumberjack ingestor that is using all the CPU.

Digging into New Relic:

it seems quite clear to me that when we get a spike in network traffic, we get a spike in the CPU usage of java / logstash. When-ever I've manually run htop during a spike today its been the lumberjack ingestors using the CPU; as shown above

mrdavidlaing commented 10 years ago

Concerning round-robin.

We must switch to using EIPs so I can get them white listed; and stop using the SSH tunnel. See https://cityindex.kanbantool.com/boards/78803#tasks-3147601
We'll have to test logstash-forwarder with round-robin DNS, and see how it handles the failure case.
Failing use of round-robin DNS, we could map ingestor1.cityindex.logsearch.io to EIP1, and ingestor2.cityindex.logsearch.io to EIP2, and put that in the logstash config rather than the EIPs

dpb587 commented 10 years ago

To clarify, based on Friday's experience I was more meaning: while the redis node may have been maxed out, the queue was still rapidly growing; and, given the queue was functional, further down the lines we weren't able to keep up with that queue, and that's what I was considering to be the bigger bottleneck.

mrdavidlaing commented 10 years ago

I see what you're saying. At any rate; in the short term I don't think we should do anything more than increase the queue instance size to give it more CPUs.

mrdavidlaing commented 10 years ago

There doesn't seem to be much we can do to reduce the lumberjack ingestor CPU usage in the short term. Closing this as won't fix for now

ryanbrink commented 10 years ago

I'm running into this issue (I think). I just spun my ingestor (a logstash instance that simply takes Lumberjack and dumps it into Redis) out into a separate instance and Lumberjack is always using all of the CPU:

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||97.8%]     Tasks: 31, 114 thr; 5 running
  2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||95.7%]     Load average: 3.94 3.88 3.76
  Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||342/1692MB]     Uptime: 05:30:53
  Swp[                                                                                                                       0/895MB]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   PID Command
15119 root       20   0 1867M  187M 13852 S 192. 11.1 15119 /usr/bin/java -Xmx500m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -jar /usr/local/liveview-shipping/logstash-1.4.0/v
15626 root       20   0 1867M  187M 13852 R 58.1 11.1 15626 <lumberjack
15633 root       20   0 1867M  187M 13852 R 42.5 11.1 15633 <lumberjack
15634 root       20   0 1867M  187M 13852 R 40.3 11.1 15634 <lumberjack
15159 root       20   0 1867M  187M 13852 R 39.9 11.1 15159 <lumberjack

I don't have a ton of traffic (I don't think): about 10 nodes are pushing around one event per second right now (total of 10 connections, 10 logs/second). Is this normal performance? Anything I can do to dig into those threads and see where they're spending their time? This is an EC2 c1.medium instance.

mrdavidlaing commented 10 years ago

@ryno2019 - in our case the CPU gobbler turned out to be incorporating the multiline filter codec inside the lumberjack input.

When we moved that out into a regular multiline filter the CPU usage seemed to reduce.

What does your logstash config for the lumberjack ingestor look like?

ryanbrink commented 10 years ago

@mrdavidlaing - Thanks for getting back! My input config has only the following:

lumberjack {
    port => 3333
    type => "tcpssl"
    ssl_certificate => "./logstash-agent.crt"
    ssl_key => "./logstash-agent.key"
    tags => ["lumber"]
    codec => "plain"
  }

The log messages coming in are JSON blobs from a custom application we wrote.

mrdavidlaing commented 10 years ago

@ryno2019 - Have you tried removing the tags and codec lines; and seeing if those make any difference to CPU usage?

ryanbrink commented 10 years ago

I'll try that now. I restarted Logstash without those lines and the utilization looks much better, but sometimes it seems to ramp back up over time, so I'll let you know what happens throughout the day.

Isn't "plain" the default codec? And if so, any idea why specifying it makes any difference versus leaving the default? Also with "tags": why could adding a simple tag eat so much CPU?

Thanks again for your help!

ryanbrink commented 10 years ago

It's looking quite stable now, although I did run into one issue: I switched a bunch of my reporting agents all at once from one of my "ingestor" servers to the other and the CPU (on the Logstash server) went right to 100% and seemed to stay there. Rebooting Logstash brought it back down and it looks stable now, but do you have any idea what might cause that?

Is there any easy way to debug the Lumberjack input to check for queue sizes and see where it's spending most of it's time?

mrdavidlaing commented 10 years ago

@ryno2019 - I'm afraid that you have just become as much of an expert as me. If you find out how to get better perf info for the lumberjack input; please let me know!

ryanbrink commented 10 years ago

Thanks! I let it run for a while and again: at some point–for no obvious reason–the CPU jumped back to 100% and stayed there. I'll see if I can dig in a little more to see what Lumberjack is spending it's time on...

mrdavidlaing commented 10 years ago

@ryno2019 - are you seeing any "crash" style errors in the logstash logs? I know that when logstash starts up it uses 100% CPU (I presume because JRuby is compiling stuff); so if its crashing regularly then that might be the cause of the CPU usage.

ryanbrink commented 10 years ago

@mrdavidlaing - Hmm, no. Nothing in the Logstash logs. I started it yesterday afternoon, and at 2:00am this morning Lumberjack started using 100% of one core. It's continued to run without spitting out any logs whatsoever. :/ I'll try upping the verbosity to see if I can get any insight...

cityindex-attic / logsearch

lumberjack ingestor/endpoint - using 100% CPU #353