Closed parabolic closed 7 years ago
and here's a gstack (amongst other things)
It's croaked again. The final entry to ES is/was:
{ "_index": "logstash-2015.06.02", "_type": "syslog-ng", "_id": "AU215SPk-gN6IiV6gG8A", "_score": 1, "_source": { "message": "ameter), scsistat:0x02 (Check condition), snskey:0x05 (Illegal request), asc/ascq:0x26/0x0 (Invalid field in parameter list), info:0x0, cmd_spec:0x0, sns_spec:0x0, host:0x0, abort:0, CDB:1C010201B000 (Receive diagnostic), blk:0x0, blkcnt 0x0, fru_cd:0x0, LUN:0, LUN_WWN:0000000000000000 after 0.018s, toterr:41344, deverr:41123", "@version": "1", "@timestamp": "2015-06-02T20:10:01.593Z", "type": "syslog-ng", "host": "3PAR_1648417", "path": "/var/log/syslog-ng/3PAR_1648417/2015/06/20150602.log", "syslog_host": "vp-logstash02", "tags": [ "3par", "_grokparsefailure" ] }, "fields": { "@timestamp": [ 1433275801593 ] } }
examining the raw syslog in question, however, shows:
$ grep 'ameter), scsistat:0x02 (Check condition), snskey:0x05 (Illegal request), asc/ascq:0x26/0x0 (Invalid field in parameter list), info:0x0, cmd_spec:0x0, sns_spec:0x0, host:0x0, abort:0, CDB:1C010201B000 (Receive diagnostic), blk:0x0, blkcnt 0x0, fru_cd:0x0, LUN:0, LUN_WWN:0000000000000000 after 0.018s, toterr:41344, deverr:41123' /var/log/syslog-ng/3PAR_1648417/2015/06/20150602.log
2015-06-02T00:28:29-03:00 3PAR_1648417 dskerr hw_disk:50050CC1020582FA target SAS 50050CC1020582FA on 1:0:1: cmdstat:0x08 (TE_INVALID -- Invalid input parameter), scsistat:0x02 (Check condition), snskey:0x05 (Illegal request), asc/ascq:0x26/0x0 (Invalid field in parameter list), info:0x0, cmd_spec:0x0, sns_spec:0x0, host:0x0, abort:0, CDB:1C010201B000 (Receive diagnostic), blk:0x0, blkcnt 0x0, fru_cd:0x0, LUN:0, LUN_WWN:0000000000000000 after 0.018s, toterr:41344, deverr:41123
red-herring? or useful?
perhaps not at all a red-herring... after a forced restart, it croaked again after the same message...
[edit] again, choked on same syslog entry (and continues to incorrectly ingest the raw data)
gist'd for clarity: https://gist.github.com/surlypants/3b441c1d04469036bdc9
[edit]
seems that LS is going out to lunch as soon as it sees _grokparsefailure.
To try to debug why LS is not interpreting the underlying syslog message, I added this clause at the top of the appropriate filter (because it's seemingly only this one host that has suddenly began having this problem)
if [host] == "3PAR_1648417" {
mutate {
add_field => { "debug_message" => "%{message}" }
}
}
debug_message => nd: getsralertcrit all Error: no criteria listed} {}
while the underlying syslog message is:
2015-06-02T23:59:07-03:00 3PAR_1648417 cli_cmd_err sw_cli {3paradm super all {{0 8}} -1 10.0.2.66 306} {Command: getsralertcrit all Error: no criteria listed} {}
I'm so confused...
Can you give me a sample log that will reproduce your issue? This would be great will take a look later.
you've got mail.
I've checked your stack trace @Centopus and it looks exactly like my #3361 where there's a PR linked you could try. The issue is deadlock between single worker on mutex sync inside a WHEN node and pipeline thread on mutex sync in a queue push. I'm happy to look through others' stack traces to help find matches for #3361 - so far just yours in this issue I think.
@surlypants if you send QUIT signal to logstash (kill -QUIT pidoflogstash
) Java will catch it and give a stack trace at the Java level. Unfortunately gstack
is too low level at the Runtime level and all JRuby meaning is lost. Seems jstack
is unreliable too (QUIT pretty does same but to stdout).
@driskell no luck:
$ ps ax |grep logstash |grep java 7095 ? SNl 24:28 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupa ncyOnly -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - java.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xbootclasspath/a:/opt/logstash/ vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log -w 6
$ kill -QUIT 7095
$ ps ax |grep logstash |grep java 7095 ? SNl 24:29 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupa ncyOnly -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - java.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xbootclasspath/a:/opt/logstash/ vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log -w 6
@surlypants The trace will appear in the log file logstash is outputting to. The QUIT signal won't kill the process - it will just print it's stack trace to the log. If you can put it in a gist it will be easier to read.
Hello again. I've been trying to induce the hangup by different types of log rotation. No luck.
I'm currently out of ideas. I'm following the topic, just message me if I can help with anything else.
@centopus your stack trace looked exactly like #3361 which you'll likely find almost impossible to reproduce on demand. Hopefully fix be merged soon
I'm able to reproduce this consistently on master after inserting some sleeps in particular places, and then running logstash:
logstash-1.5.0 % bin/logstash -e 'input { generator { count => 25 message => "1"} } filter { clone { clones => ["a"]} }'
pushing flush_event
Logstash startup completed
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing event
pushing flush_eventpushing flush_event
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
@jsvd Do you have a stack trace? Can you send QUIT signal and provide it in a gist? If it is #3361 there's a PR to fix it you could try: #3363
@driskell :facepalm: I wanted to post this in #3361, it is the scenario of flush -> input queue fills -> flusher thread triggers deadlock
I'm checking your PR now..
This is my log of logstash when logstash died.
{:timestamp=>"2015-06-05T10:50:02.123000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}
{:timestamp=>"2015-06-05T10:50:02.132000+0000", :message=>"Lumberjack input, maximum connection exceeded, new connection are rejected.", :max_clients=>nil, :level=>:warn}
What about netstat -pnt | grep java
when it's being "stuck"? Only 1 man provided that.
after much wailing and gnashing of teeth, i believe my problem was ultimately related to:
after inserting a boatload more "security-casing", my issue has subsided.
We have merge this PR https://github.com/elastic/logstash/issues/3361 in the master branch, that should fix hang related to the multiline filter or other any plugins that periodically flush their buffer. If you are experiencing this issue try the PR, we should release a 1.5.1 version soon containing the fix.
I'm using 1.5.1 and have the same issue:
sudo service logstash stop Killing logstash (pid 10409) with SIGTERM Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... Waiting logstash (pid 10409) to die... logstash stop failed; still running.
Same issue here with 1.5.4. Logstash takes more than 100% CPU and crashes after sometime.
I think we can close this issue, its from LS 1.5.1 and lot of logic has changed int the related code. If its still an issue with a newer LS version please open a new issue.
Hi, today I've upgraded to logstash version 1:1.5.0-1 on ubuntu 14.04, x86_64, and it seems to hangs after some time without any notification in the logs whatsoever. When I try to stop the process with issuing
or
I get the following:
and then I have to do a
to kill the process and it goes on and on. I started logstash manually with the debug option and after some time it died again but I got the following line at the end
The debug output stops at this point and the process hangs.
I have a lot of logshtash "senders" which just use lumberjack to the main node to send encrypted logs and they are upgraded to 1.5 and they work fine. It seems that the "master" logstash with elasticsearch output has a problem somewhere which is causing it to hang. here's the output config
At the moment I've reverted back to logstash 1.4.2 Thanks and regards.