elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
72 stars 3.5k forks source link

Logstash stuck when the output is blocked and persistent queue is full #14740

Open ferdose7 opened 2 years ago

ferdose7 commented 2 years ago

We are using Logstash 7.15.2 version. The Logstash pipeline is configured as follows.

`pipelines.yml:

Logstash is configured with syslog/lumberjack server and elasticsearch output pipelines. [2022-10-28T13:31:53.053Z][INFO ][logstash.agent ] Pipelines running {:count=>3, :running_pipelines=>[:syslog, :elasticsearch, :logstash], :non_running_pipelines=>[]}

The following steps may trigger the issue:

  1. Deploy Logstash with syslog/lumberjack server and elasticsearch output pipelines.

  2. Make syslog/lumberjack server down. We see Logstash keep trying to connect to the server, but it fails as syslog/lumberjack server is down.

  3. Logstash keeps trying to connect to syslog/lumberjack server. [2022-11-04T15:39:15.015Z][WARN ][logstash.outputs.syslog ] syslog ssl-tcp output exception: closing, reconnecting and resending event {:host=>"host.com", :port=>8080, :exception=>#<SocketError: initialize: name or service not known>, :backtrace=>["org/jruby/ext/socket/RubyTCPSocket.java:141:ininitialize'", "org/jruby/RubyIO.java:876:in new'", "/opt/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5.E001/lib/logstash/outputs/syslog.rb:219:inconnect'", "/opt/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5.E001/lib/logstash/outputs/syslog.rb:187:in publish'", "/opt/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-plain-3.1.0/lib/logstash/codecs/plain.rb:59:inencode'", "/opt/logstash/logstash-core/lib/logstash/codecs/delegator.rb:48:in block in encode'", "org/logstash/instrument/metrics/AbstractSimpleMetricExt.java:65:intime'", "org/logstash/instrument/metrics/AbstractNamespacedMetricExt.java:64:in time'", "/opt/logstash/logstash-core/lib/logstash/codecs/delegator.rb:47:inencode'", "/opt/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-syslog-3.0.5.E001/lib/logstash/outputs/syslog.rb:147:in receive'", "/opt/logstash/logstash-core/lib/logstash/outputs/base.rb:105:inblock in multi_receive'", "org/jruby/RubyArray.java:1820:in each'", "/opt/logstash/logstash-core/lib/logstash/outputs/base.rb:105:inmulti_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:143:in multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:121:inmulti_receive'", "/opt/logstash/logstash-core/lib/logstash/java_pipeline.rb:295:in block in start_workers'"], :event=>#<LogStash::Event:0x7407a15a>}

  4. Since the syslog/lumberjack output is blocked, log events will be keep added to the persistent queue and syslog persistent queue becomes full.

  5. After some time Logstash becomes stuck, and we do not see Logstash trying to connect to syslog/lumberjack server.

  6. Bring syslog/lumberjack server up and running.

  7. Even after the syslog/lumberjack server is up and running, Logstash does not try to connect to syslog/lumberjack server and Logstash remains stuck.

  8. The issue is intermittent. eric-log-transformer-795d7cf554-c8jmw.txt

Please find attached thread dump and logs. Could you please check the issue. Please let us know if any details required.

yaauie commented 2 years ago
ferdose7 commented 2 years ago

Please refer to the comments inline.

Can you define what you mean by "syslog/lumberjack server"? Is it a single host with both Syslog server that you are sending data to with a Syslog output and a Logstash server that you are sending data to with the Lumberjack Output?

The syslog and lumberjack are the different output plugins used to send log events to the external server. We have seen issue in two different clusters. One of the clusters is using lumberjack and elasticsearch outputs and another cluster is using syslog and elasticsearch outputs. Example: Lumberjack configuration from first cluster:

output {
  lumberjack {
   id => "lumberjack1"
   hosts => ["host.com"]
   codec => json
   port => 8888
   ssl_certificate => "/run/secrets/lumberjackOutput-certs/tls.crt"
  }

}

Syslog configuration from second cluster:

output {
   syslog {
     host => "host.com"
     port => 8080
     protocol => "ssl-tcp"
     rfc => rfc5424
     use_labels => false
     appname => "%{appname}"
     priority => "%{priority}"
     message => "%{message}"
     sourcehost => "ccrc"
     procid => "%{[metadata][proc_id]}"
     msgid => "%{[metadata][category]}"
     ssl_cert => "/run/secrets/syslogOutput-certs/tls.crt"
     ssl_key => "/run/secrets/syslogOutput-certs/tls.key"
     ssl_cacert => ["/run/secrets/syslogOutput-cacerts/trustedcert"]
     ssl_verify => true
   }
 }

For the first cluster, issue was seen when lumberjack server is down. For the second cluster, issue is seen when syslog server is down.

Can you share the general shape of those pipelines (what inputs, rough quantity and kinds of filters, what quantity and kind of outputs are in each)?

Please refer to the attached sample config used in logstash.

Can you tell me whether the pipelines are related to each other (for example, using pipeline-to-pipeline, or any other way of sending the events from one pipeline to another), and if so, how?

We are not using pipeline-to-pipeline. Please refer to attached sample configuration used in Logstash.

logstash-config.txt.txt