Filebeat stays disconnected after logstash performance problems (even long after those are resolved)

Mekk commented 4 years ago

Relates https://github.com/elastic/beats/issues/40544
Relates https://github.com/elastic/beats/issues/661

I am not sure whether this is sufficient for detailed analysis, but I decided to report the problem so sb could take a look at filebeat behaviour having this context in mind.

Somewhat strange situation, but I observed it on many machines (filebeat 7.5.2, logstash also 7.5.2):

logstash (to which filebeat's connect) faced big performance problems (it was due to inefficient XML parsing, https://github.com/elastic/logstash/issues/11599 , but this is only context of this issue, if I were to test this behaviour on purpose I'd probably write some busy loop in ruby script triggered from a filter)
sooner or later filebeat started to report timeouts (properly, logstash didn't manage to handle communication fast enought)
… but for some reason filebeat remained in this state forever. Even long after logstash was restarted and the problem it faced resolved, running filebeat instance never recovered (the instance I forgot to restart was still disconnected more than 24 hours after the problem was resolved).

Restarting filebeat helped, but there is sth wrong in the fact that it didn't manage to recover itself (after all, in normal logstash restart/temporary inaccessibility case I never faced similar problems)

Logs picture. Here the problem started:

2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:35.025+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-12T19:44:36.909+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-12T19:44:36.909+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-12T19:44:40.506+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:40.506+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 1 reconnect attempt(s)
2020-02-12T19:44:47.796+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:47.796+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 2 reconnect attempt(s)
…
2020-02-12T19:46:18.110+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 5 reconnect attempt(s)
2020-02-12T19:46:18.111+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-12T19:46:50.516+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55834->10.92.23.57:5044: i/o timeout
2020-02-12T19:46:50.556+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected

This is all OK, logstash was massacring CPU and likely was unable to keep up with connections.

But then, the problem was resolved early next day, logstash was restarted, those filebeats which were restarted work since then happily. But the filebeat instance which remained unrestarted still doesn't push logs and logs things like (random snippet from log taken more than 24 hours since the problem was resolved):

2020-02-14T16:30:47.871+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:30:47.872+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:17.920+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:33111->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:17.967+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-14T16:31:19.305+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-14T16:31:19.306+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:19.307+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:49.350+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:33112->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:49.391+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-14T16:31:50.807+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-14T16:31:50.807+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:50.808+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:20.854+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:33113->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:20.896+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-14T16:32:22.755+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-14T16:32:22.755+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:22.756+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:52.804+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:33114->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:52.844+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-14T16:32:54.005+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-14T16:32:54.005+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:54.007+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:33:24.052+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:33115->10.92.23.57:5044: i/o timeout
2020-02-14T16:33:24.095+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-14T16:33:25.708+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-14T16:33:25.708+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:33:25.710+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established

and so on, and so on, and so on, until restart (after which everything started to work OK).

This is the extreme case, but in general any filebeat instance which started to report errors like above had to be restarted.

To my naive eye it looks as if something desynchronized here, as if new established connections were closed due to old error notes or due to backpool of old errors or sth like that.

The error context may be significant because of specific behaviour – upstream connections were not closed by remote side, but simply were not handled (maybe some buffers filled etc).

PS It may matter, or not, that I use a few sections and there are plenty of log files

jsoriano commented 4 years ago

Also reported with Auditbeat 7.6.0 and Logstash 7.4.2 in https://github.com/elastic/beats/issues/16864.

elasticmachine commented 4 years ago

Pinging @elastic/integrations-services (Team:Services)

jsoriano commented 4 years ago

This discuss topic could be related too: https://discuss.elastic.co/t/filebeat-performance-stall-sometimes/222207

urso commented 4 years ago

While events are actively processed by Logstash, Logstash sends an 'heartbeat' signal to Filebeat every 10s I think. Filebeat times out the connection if no signal was received for the last 30s (see setting: output.logstash.timeout).

After the timeout we're even seeing: Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused. Filebeat can't reconnect. Either there are network problems, or there is a connection limit reached in Logstash. Maybe the the workers (not sure, but I think Logstash uses a separate worker pool for beats communication) didn't get CPU time by the OS.

M0rdecay commented 4 years ago

Same problem with Filebeat and Logstash 7.7.1. This is a serious trouble if there is periodic work on the network that makes Logstash unavailable for a while.

danlsgiga commented 3 years ago

Just hit this very same problem here... this is a silent failure that caused us to miss days of logs because filebeat either failed hard and forced the initd/systemd subsystem to restart it, nor logged or retried the connection.

We do have the output.logstash.timeout setting but it seems to be of no effect in this specific scenario.

filebeat v7.5.2

seadub commented 3 years ago

bump, impacts filebeat 6.8 as well.

whyayala commented 3 years ago

Also seeing this in filebeat 7.15 deployed as a Task in AWS ECS with an image from ECR public gallery.

https://gallery.ecr.aws/elastic/filebeat

ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: client is not connected

barionoperators commented 2 years ago

Hi, We have the same problem with filebeat and winlogbeat 7.16.3. Tried every combination of firewall rules ( port 5044 to and from the logstash server, rule with all ports open). Creating these FW rule seemed to resolve the problem so we started to narrow down the FW rules. Now we have a situation where some clients (VLAN1) works perfectly without any additional FW rule (only 5044 allowed form the client to the server), and we have clients (VLAN2) which won't work after a logstash restart although all ports are open. It is a very blocking issue for us. Can someone help?

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Carmelly212 commented 2 years ago

Hi, same issue here ERROR logstash/async.go:256 Failed to publish events caused by: write tcp 10.2.... trying to run filebeat as a sidecar in a helm chart... I get the error when ever there is a JSON parse error but other then that logs are not showing up in elastic. can't tell if they reach logstash

sebglon commented 1 year ago

hi, We have the same issue; i describe it here: https://discuss.elastic.co/t/filebeat-connection-error-with-logstash/327208

botelastic[bot] commented 8 months ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

henrikno commented 4 months ago

👍

Mekk commented 4 months ago

Reminder of the test scenario:

provoke situation where logstash is cpu-starved (in my case it was doing sth very cpu-intensive during log processing, in real case it was xml parser mishap, but anything would likely do – just write ruby filter which is doing some calculations for 30 seconds for ⅓ entries)
wait until filebeat reports connection problems (while logstash is working and doing that slow cpu-intensive processing)
„fix” (remove cpu-intensive filter) and restart logstash
observe whether filebeat manages to connect back

elastic / beats

Filebeat stays disconnected after logstash performance problems (even long after those are resolved) #16335