Logstash 1:1.5.0-1 dies after some time

parabolic commented 9 years ago

Hi, today I've upgraded to logstash version 1:1.5.0-1 on ubuntu 14.04, x86_64, and it seems to hangs after some time without any notification in the logs whatsoever. When I try to stop the process with issuing

service logstash stop

or

/etc/init.d/logstash stop

I get the following:

killing logstash (pid 26022) with SIGTERM
Waiting logstash (pid 26022) to die...
Waiting logstash (pid 26022) to die...
Waiting logstash (pid 26022) to die...
Waiting logstash (pid 26022) to die...
Waiting logstash (pid 26022) to die...
logstash stop failed; still running.
logstash started.

and then I have to do a

kill -9

to kill the process and it goes on and on. I started logstash manually with the debug option and after some time it died again but I got the following line at the end

@metadata_accessors=#<LogStash::Util::Accessors:0x2513a1f1 @store={"retry_count"=>0}, @lut={}>, @cancelled=false>]]}, :batch_timeout=>1, :force=>true, :final=>nil, :level=>:debug, :file=>"stud/buffer.rb", :line=>"207", :method=>"buffer_flush"}
Sending bulk of actions to client[0]: localhost {:level=>:debug, :file=>"logstash/outputs/elasticsearch.rb", :line=>"461", :method=>"flush"}
Shifting current elasticsearch client {:level=>:debug, :file=>"logstash/outputs/elasticsearch.rb", :line=>"468", :method=>"flush"}
Switched current elasticsearch client to #0 at localhost {:level=>:debug, :file=>"logstash/outputs/elasticsearch.rb", :line=>"518", :method=>"shift_client"}

The debug output stops at this point and the process hangs.

I have a lot of logshtash "senders" which just use lumberjack to the main node to send encrypted logs and they are upgraded to 1.5 and they work fine. It seems that the "master" logstash with elasticsearch output has a problem somewhere which is causing it to hang. here's the output config

elasticsearch {
      host => "localhost"
      protocol => "http"
    }

At the moment I've reverted back to logstash 1.4.2 Thanks and regards.

surlypants commented 9 years ago

and here's a gstack (amongst other things)

https://gist.github.com/surlypants/0b4482a37573a0907ec0

surlypants commented 9 years ago

It's croaked again. The final entry to ES is/was:

{ "_index": "logstash-2015.06.02", "_type": "syslog-ng", "_id": "AU215SPk-gN6IiV6gG8A", "_score": 1, "_source": { "message": "ameter), scsistat:0x02 (Check condition), snskey:0x05 (Illegal request), asc/ascq:0x26/0x0 (Invalid field in parameter list), info:0x0, cmd_spec:0x0, sns_spec:0x0, host:0x0, abort:0, CDB:1C010201B000 (Receive diagnostic), blk:0x0, blkcnt 0x0, fru_cd:0x0, LUN:0, LUN_WWN:0000000000000000 after 0.018s, toterr:41344, deverr:41123", "@version": "1", "@timestamp": "2015-06-02T20:10:01.593Z", "type": "syslog-ng", "host": "3PAR_1648417", "path": "/var/log/syslog-ng/3PAR_1648417/2015/06/20150602.log", "syslog_host": "vp-logstash02", "tags": [ "3par", "_grokparsefailure" ] }, "fields": { "@timestamp": [ 1433275801593 ] } }

examining the raw syslog in question, however, shows:

$ grep 'ameter), scsistat:0x02 (Check condition), snskey:0x05 (Illegal request), asc/ascq:0x26/0x0 (Invalid field in parameter list), info:0x0, cmd_spec:0x0, sns_spec:0x0, host:0x0, abort:0, CDB:1C010201B000 (Receive diagnostic), blk:0x0, blkcnt 0x0, fru_cd:0x0, LUN:0, LUN_WWN:0000000000000000 after 0.018s, toterr:41344, deverr:41123' /var/log/syslog-ng/3PAR_1648417/2015/06/20150602.log

2015-06-02T00:28:29-03:00 3PAR_1648417 dskerr hw_disk:50050CC1020582FA target SAS 50050CC1020582FA on 1:0:1: cmdstat:0x08 (TE_INVALID -- Invalid input parameter), scsistat:0x02 (Check condition), snskey:0x05 (Illegal request), asc/ascq:0x26/0x0 (Invalid field in parameter list), info:0x0, cmd_spec:0x0, sns_spec:0x0, host:0x0, abort:0, CDB:1C010201B000 (Receive diagnostic), blk:0x0, blkcnt 0x0, fru_cd:0x0, LUN:0, LUN_WWN:0000000000000000 after 0.018s, toterr:41344, deverr:41123

red-herring? or useful?

surlypants commented 9 years ago

perhaps not at all a red-herring... after a forced restart, it croaked again after the same message...

[edit] again, choked on same syslog entry (and continues to incorrectly ingest the raw data)

gist'd for clarity: https://gist.github.com/surlypants/3b441c1d04469036bdc9

[edit]

seems that LS is going out to lunch as soon as it sees _grokparsefailure.

To try to debug why LS is not interpreting the underlying syslog message, I added this clause at the top of the appropriate filter (because it's seemingly only this one host that has suddenly began having this problem)

    if [host] == "3PAR_1648417" {
        mutate {
            add_field => { "debug_message" => "%{message}" }
        }
    }

debug_message => nd: getsralertcrit all Error: no criteria listed} {}

while the underlying syslog message is:

2015-06-02T23:59:07-03:00 3PAR_1648417 cli_cmd_err sw_cli {3paradm super all {{0 8}} -1 10.0.2.66 306} {Command: getsralertcrit all Error: no criteria listed} {}

I'm so confused...

ph commented 9 years ago

Can you give me a sample log that will reproduce your issue? This would be great will take a look later.

surlypants commented 9 years ago

you've got mail.

driskell commented 9 years ago

I've checked your stack trace @Centopus and it looks exactly like my #3361 where there's a PR linked you could try. The issue is deadlock between single worker on mutex sync inside a WHEN node and pipeline thread on mutex sync in a queue push. I'm happy to look through others' stack traces to help find matches for #3361 - so far just yours in this issue I think.

@surlypants if you send QUIT signal to logstash (kill -QUIT pidoflogstash) Java will catch it and give a stack trace at the Java level. Unfortunately gstack is too low level at the Runtime level and all JRuby meaning is lost. Seems jstack is unreliable too (QUIT pretty does same but to stdout).

surlypants commented 9 years ago

@driskell no luck:

$ ps ax |grep logstash |grep java 7095 ? SNl 24:28 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupa ncyOnly -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - java.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xbootclasspath/a:/opt/logstash/ vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log -w 6

$ kill -QUIT 7095

$ ps ax |grep logstash |grep java 7095 ? SNl 24:29 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupa ncyOnly -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - java.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xbootclasspath/a:/opt/logstash/ vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log -w 6

driskell commented 9 years ago

@surlypants The trace will appear in the log file logstash is outputting to. The QUIT signal won't kill the process - it will just print it's stack trace to the log. If you can put it in a gist it will be easier to read.

Centopus commented 9 years ago

Hello again. I've been trying to induce the hangup by different types of log rotation. No luck.

I'm currently out of ideas. I'm following the topic, just message me if I can help with anything else.

driskell commented 9 years ago

@centopus your stack trace looked exactly like #3361 which you'll likely find almost impossible to reproduce on demand. Hopefully fix be merged soon

jsvd commented 9 years ago

I'm able to reproduce this consistently on master after inserting some sleeps in particular places, and then running logstash:

logstash-1.5.0 % bin/logstash -e 'input { generator { count => 25 message => "1"} } filter { clone { clones => ["a"]} }'
pushing flush_event
Logstash startup completed
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing flush_eventpushing flush_event

^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}

driskell commented 9 years ago

@jsvd Do you have a stack trace? Can you send QUIT signal and provide it in a gist? If it is #3361 there's a PR to fix it you could try: #3363

jsvd commented 9 years ago

@driskell :facepalm: I wanted to post this in #3361, it is the scenario of flush -> input queue fills -> flusher thread triggers deadlock

jsvd commented 9 years ago

I'm checking your PR now..

vankhoa011 commented 9 years ago

This is my log of logstash when logstash died.

{:timestamp=>"2015-06-05T10:50:02.123000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}
{:timestamp=>"2015-06-05T10:50:02.132000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}

celesteking commented 9 years ago

What about netstat -pnt | grep java when it's being "stuck"? Only 1 man provided that.

surlypants commented 9 years ago

after much wailing and gnashing of teeth, i believe my problem was ultimately related to:

https://goo.gl/XDd4kH

after inserting a boatload more "security-casing", my issue has subsided.

ph commented 9 years ago

We have merge this PR https://github.com/elastic/logstash/issues/3361 in the master branch, that should fix hang related to the multiline filter or other any plugins that periodically flush their buffer. If you are experiencing this issue try the PR, we should release a 1.5.1 version soon containing the fix.

apple-corps commented 9 years ago

I'm using 1.5.1 and have the same issue:

sudo service logstash stop Killing logstash (pid 10409) with SIGTERM Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... logstash stop failed; still running.

emoperations commented 9 years ago

Same issue here with 1.5.4. Logstash takes more than 100% CPU and crashes after sometime.

ph commented 7 years ago

I think we can close this issue, its from LS 1.5.1 and lot of logic has changed int the related code. If its still an issue with a newer LS version please open a new issue.

elastic / logstash

Logstash 1:1.5.0-1 dies after some time #3276