Open Mekk opened 4 years ago
Also reported with Auditbeat 7.6.0 and Logstash 7.4.2 in https://github.com/elastic/beats/issues/16864.
Pinging @elastic/integrations-services (Team:Services)
This discuss topic could be related too: https://discuss.elastic.co/t/filebeat-performance-stall-sometimes/222207
While events are actively processed by Logstash, Logstash sends an 'heartbeat' signal to Filebeat every 10s I think. Filebeat times out the connection if no signal was received for the last 30s (see setting: output.logstash.timeout
).
After the timeout we're even seeing: Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
. Filebeat can't reconnect. Either there are network problems, or there is a connection limit reached in Logstash. Maybe the the workers (not sure, but I think Logstash uses a separate worker pool for beats communication) didn't get CPU time by the OS.
Same problem with Filebeat and Logstash 7.7.1. This is a serious trouble if there is periodic work on the network that makes Logstash unavailable for a while.
Just hit this very same problem here... this is a silent failure that caused us to miss days of logs because filebeat either failed hard and forced the initd/systemd subsystem to restart it, nor logged or retried the connection.
We do have the output.logstash.timeout
setting but it seems to be of no effect in this specific scenario.
filebeat v7.5.2
bump, impacts filebeat 6.8 as well.
Also seeing this in filebeat 7.15 deployed as a Task in AWS ECS with an image from ECR public gallery.
https://gallery.ecr.aws/elastic/filebeat
ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: client is not connected
Hi, We have the same problem with filebeat and winlogbeat 7.16.3. Tried every combination of firewall rules ( port 5044 to and from the logstash server, rule with all ports open). Creating these FW rule seemed to resolve the problem so we started to narrow down the FW rules. Now we have a situation where some clients (VLAN1) works perfectly without any additional FW rule (only 5044 allowed form the client to the server), and we have clients (VLAN2) which won't work after a logstash restart although all ports are open. It is a very blocking issue for us. Can someone help?
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Hi, same issue here ERROR logstash/async.go:256 Failed to publish events caused by: write tcp 10.2....
trying to run filebeat as a sidecar in a helm chart...
I get the error when ever there is a JSON parse error but other then that logs are not showing up in elastic. can't tell if they reach logstash
hi, We have the same issue; i describe it here: https://discuss.elastic.co/t/filebeat-connection-error-with-logstash/327208
Hi! We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
.
Thank you for your contribution!
👍
Reminder of the test scenario:
provoke situation where logstash is cpu-starved (in my case it was doing sth very cpu-intensive during log processing, in real case it was xml parser mishap, but anything would likely do – just write ruby filter which is doing some calculations for 30 seconds for ⅓ entries)
wait until filebeat reports connection problems (while logstash is working and doing that slow cpu-intensive processing)
„fix” (remove cpu-intensive filter) and restart logstash
observe whether filebeat manages to connect back
I am not sure whether this is sufficient for detailed analysis, but I decided to report the problem so sb could take a look at filebeat behaviour having this context in mind.
Somewhat strange situation, but I observed it on many machines (filebeat 7.5.2, logstash also 7.5.2):
logstash
(to which filebeat's connect) faced big performance problems (it was due to inefficient XML parsing, https://github.com/elastic/logstash/issues/11599 , but this is only context of this issue, if I were to test this behaviour on purpose I'd probably write some busy loop in ruby script triggered from a filter)sooner or later filebeat started to report timeouts (properly, logstash didn't manage to handle communication fast enought)
… but for some reason filebeat remained in this state forever. Even long after logstash was restarted and the problem it faced resolved, running filebeat instance never recovered (the instance I forgot to restart was still disconnected more than 24 hours after the problem was resolved).
Restarting filebeat helped, but there is sth wrong in the fact that it didn't manage to recover itself (after all, in normal logstash restart/temporary inaccessibility case I never faced similar problems)
Logs picture. Here the problem started:
This is all OK, logstash was massacring CPU and likely was unable to keep up with connections.
But then, the problem was resolved early next day, logstash was restarted, those filebeats which were restarted work since then happily. But the filebeat instance which remained unrestarted still doesn't push logs and logs things like (random snippet from log taken more than 24 hours since the problem was resolved):
and so on, and so on, and so on, until restart (after which everything started to work OK).
This is the extreme case, but in general any filebeat instance which started to report errors like above had to be restarted.
To my naive eye it looks as if something desynchronized here, as if new established connections were closed due to old error notes or due to backpool of old errors or sth like that.
The error context may be significant because of specific behaviour – upstream connections were not closed by remote side, but simply were not handled (maybe some buffers filled etc).
PS It may matter, or not, that I use a few sections and there are plenty of log files