Open Callidryas opened 8 years ago
+1 we have changing load on our cluster of logstashes during the day, and the easiest way to autoscale is via ELB. Reconnect should be attempted, and DNS should be re-queried at the same time (either respect DNS TTL or just don't cache)
+1 for forced TCP close / worker re-gen interval
+1 This will be great !
any updates on this one?
Hi, I am trying to solve this by creating a TTL config on logstash output plugin.
This TTL config will contain the number of seconds to maintain the connection alive. If this number is greater than zero, then we start a timer which will be reset after 'n' seconds (where 'n' = ttl in seconds)
Before sending events, we check if our timer has expired. If yes, then we close the client & connect again.
I did an initial commit here: https://github.com/maiconio/beats/commit/274970c5f3af9bc46d16021f8a995acae951e09e and in my experiments everything is working as expected (including the test suite).
I am willing to create a PR, but before doing that I would like some feedback and some guidance on how to improve this.
Thanks
But that doesn't solve the problem with filebeat, does it?
Hi @fiunchinho. I think that it does.
Filebeat is built on top of libbeat library and his plugins. This way, I think that the right place to implement a TTL config would be inside Libbeat Logstash output plugin.
I did setup a test environment with one Filebeat instance writing to 3 logstash instances running behind an Nginx loadbalancer. In my tests, the Filebeat instance keeps connected to the same Logstash instance until we reach the TTL limit. After that the client gets disconected and then connects again, this time with another Logstash instance (accordingly Load Balancer round-robin algorithm).
Sounds about right π
@maiconio any update on this issue? :) π
+1. We are autoscaling our Logstash hosts, but we have found that new Logstash hosts aren't receiving any traffic unless we restart Filebeats everywhere.
@joshuaspence
You can go around this issue by specifing the DNS multiple times e.g. for us it would be Filebeat -> Logstash -> Elasticsearch.
Filebeat config would contain:
["internal.logstash.XXX.co.uk", "internal.logstash.XXX.co.uk", "internal.logstash.XXX.co.uk"]
Logstash contains:
elasticsearch {
hosts => ["{{elasticsearch_host}}", "{{elasticsearch_host}}", "{{elasticsearch_host}}", "{{elasticsearch_host}}"]
index => "%{component}-%{+YYYY-MM-dd}"
}
This allows both Filebeat and Logstash to renew TCP connection. We can then auto scale logstash and scale up elasticsearch when needed without seeing any CPU issues on the services.
@mazurio Do you have loadbalance: true configured on the filebeat?
@maiconio Any news on that pull request? This seems like a well thought out feature and suits well with ELB and similar LBs.
Hi @Bizzelicious,
It's set to:
output:
logstash:
bulk_max_size: 100
timeout: 15
hosts: {{hosts}}
port: 5044
loadbalance: true
+1 for this feature, as sometimes there is a need re-establish connection based on a Global DNS load balancing change.
To modify my previous statement,
One possible solution is to set a TCP connection timeout on the load balancer (which might actually be the better solution!).
Since we strive to support load balancing on our own within beats, I agree that that code should probably manage this sort of pain. In particular, if as part of the endpoint health checks, we could check if DNS is still valid, that would be stellar.
Would it be a fair assumption that if DNS changes, then we should abandon an old connection and re-establish? (As part of supporting load balancing behaviour).
So to summarize my opinion:
@PhaedrusTheGreek
Not all load balancers can kill an established connection. AWS ELB and ALB can only timeout idle connections for example.
@Bizzelicious noted. And technically RST'ing a connection is not as clean as client software deciding that a connection has been used for long enough, closing it cleanly, and re-establishing.
Since connections from filebeat to logstash are sticky, when an instance joins the load balancer, it does not get an equal distribution
Equal distribution of connections does not mean equal distribution of load. A single TCP connection costs about the same as 10,000 TCP connections. The data flowing over a connection is the load for Beats.
A load balancer is unlikely to effectively balance load for beats because the "load" (the thing which consumes cpu/network resource) in beats is an event, and there is no general correlation between load and connection count. 10,000 beats connections could be idle and not sending any data; 1 beat could be sending 100,000 events per second.
Beats supports a loadbalance mode that will try to distribute events across multiple connections to multiple Logstash endpoints. When new connections are needed, Beats will do a dns lookup (every time) and new servers appearing in the DNS entry will be considered for connection.
Beats output to Logstash, as described above, was designed to not need a load balancer at all in order to rendezvous with multiple Logstash servers. It's OK if you want to use a load balancer for some operational ability to adding/removing backends, though (instead of using DNS for this purpose, for example).
Some questions. If you add a new server to your loadbalancer backend:
TTL on a connection is a solution that focuses only on the idea that the connection itself is the load, and as said, it is not.
Some talking points I like to mention with respect to load balanncing and data streams:
The way I see the proposal, it's hoping that a TTL to destroy healthy connections will make Beats balance load. Destroying healthy connections will result in Beats retransmitting and causes duplicates to end up in Logstash and Elasticsearch.
Stepping back, I see the problem basically this: Users have a load balancer, they want to add and remove backends, and they want beats to make use of new servers as they are added to the load balancer.
I propose altering the solution proposed:
Standard Load Balancing && (Tcp Connection Count != Load) Although TCP connections don't constitute load, distributing TCP connections will generally give you distributed load the majority of the time. This may be a moot point, however because it's agreed that Beats can be a more precise load balancer.
P.S. Good load balancers will determine the true load, not just the TCP connection count, in order to determine where to route the incoming TCP connection - but either way, a new TCP connection is required.
Global Load Balancing (DNS Load Balancing) Additionally, there is a case where Beats is sending data to Logstash behind 2 Load Balancers, that are active/inactive based on DNS via a Global Load Balancer. In this case, distribution of load is not the consideration, but awareness of the site's "active/fail" status. In this case, should beats have to be restarted? If not, then should connections with invalid DNS be closed somehow?
@PhaedrusTheGreek Well, a failure (timeout, network interrupt, etc) will cause beats to need a new connection which does a dns lookup. If you're saying a "failure" a situation which is undetectable, then there would be no way for beats to determine that a new connection is needed. It feels like the assumption here is that the "old" entries in the dns or load balancer are still healthy and working? If so, why would a healthy entry be removed from your dns/balancer pool?
I still don't think I understand the problem. Here's my pitch:
With the above, load will be distributed automatically in such a way that no single Logstash is overloaded. If one Logstash becomes overloaded, many beats will detect this (through acknowledgement timeouts) and will try alternative Logstash servers.
Fair distribution of load (N servers, each server receiving exactly 1/N of the data) sounds nice, but I'm not sure how this helps more than the current model? The current model is this: If you are capable of processing M event rate per server, and you have N servers, you can do a maximum of N*M event rate. If several beats end up trying to send more than M rate to a single server, then some of those beats will timeout and reconnect to another server. This ensures the fleet of beats will cooperate by detecting when a server is overloaded (timeouts) and distributing to other servers.
distributing TCP connections will generally give you distributed load the majority of the time
This is not true. It may be true for some users, but it is not true in general.
Good load balancers will determine the true load
With this definition, have never experienced a good load balancer. Certainly none are likely to exist that are aware of what constitutes load for the beats network protocol? Beats itself is capable of doing this data distribution without need for any middleware.
Maybe we can approach this from another angle:
Currently the only answer to any of the above seems to be - restart beats!
Do we support DNS load balancing? If so, what constitutes support?
I will run through the code and diagram what happens. This is a good question to have a nice answer for :)
what do we recommend for a re-distribution workaround?
I personally do not recommend load balancers for the purposes of distributing connections because beats (and logstash-forwarder before it) already does this.
If it's helpful operationally to use a load balancer to allow adding/removing of backends, then I think this is useful. For distributing load, however, I do not think any general load balancer is going to do effectively because my experience is that load balancers assume one connection is one unit of load, which, as said, is not correct for beats.
Dropping healthy connections when DNS is invalid?
Why would a user want a healthy connection to be destroyed?
I still don't think I understand the problem. Here's my pitch:
- the beats protocol requires acknowledgements from Logstash that it has accepted a set of events.
- if no acknowledgement occurs within a given time period, beats assumes Logstash to be dead or stuck, and it will pick a new connection to retransmit and use.
With the above, load will be distributed automatically in such a way that no single Logstash is overloaded. If one Logstash becomes overloaded, many beats will detect this (through acknowledgement timeouts) and will try alternative Logstash servers.
I think we're missing some system-end-to-end view in the protocol itself regarding active 'transactions' between beats and logstash.
There is a difference between timeout and timeout so to say. Here is the catch: partial ACK acts like a keep-alive. That is, logstash will send ACK 0
every 5 seconds in case a batch is progress. This kind of keep-alive is used by logstash to signal it's still actively processing the batch. It acts as a back-pressure signal. The keep-alive is used to fight the old LSF issue of resending a big batch of events over and over and over again due to continues timeouts cause by a) logstash dealing with back-pressure b) crazy-grok pattern on bad input (slow filters) c) to big of a batch to be processable in time (here we've got some slow start windowing support, but it's more of a hack). If no keep-alive is received from logstash within time, the connection is assumed to be broken and beats reconnect + resend. This is clearly a change in behavior between LSF and beats.
Pro:
Beats simply don't know if logstash is stuck or is waiting for outputs (back-pressure). Beats can not cancel transactions (remove events from LS queues) as well. On the other hand, LS doesn't know if beats uses load-balancing or not. That is, it can not just suppress the ACK 0
signal. The current behaviour is more a trade-off between beats killing downstream systems and potential error detection based on the very localized systems view beats and logstash do have about each other.
Given our preferred use case is hundreds of beats sending to a low number of LS servers, in the load-balancing scenario I can see two modes (currently supported by beats/LS):
1) every beat connects to one LS instance only (random if configured in beats, round-robin with DNS based load balancer). The disadvantage is, connections and load might not be well distributed. Plus, with LS sending keep-alive signals, connections won't balance out over time. For LSF connections might balance out, due to timeout being effective, but main problem is this: LSF/beats can not really tell the difference in downstream state and act accordingly (wait or reconnect?).
2) every beat connects to all known LS instances. The load balancer in beats uses a shared work-queue. Slow endpoints will receive less events then faster endpoints. This way the overall systems throughput has a chance to balance out. Only if all LS are stuck, will the full system grind to a halt. This is basically how kafka operates, just have all producers send data to all brokers.
Having beats send to all known LS instances also adds the advantage of beats/LS not having to distinguish between sources of LS slow-downs. We just don't care. A slower/blocked/unavailable LS instance will just not receive any more data.
Not hey possible, but still we might have to update beats to 'cancel' a batch in order to forward it to another endpoint if some timeout hits. The connection will be closed and beats reconnect to the unavailable/slow node after some timeout. This way, a stuck node can not fully block e.g. filebeat requiring an ACK before it can continue. The main difference is, this full-ack-timeout is only used if:
Problem is with users wanting to add/remove LS instances on the fly. It's kind of nightmarish to handle/configure right now. Some bootstrapping + regular connection update by beats would be great. Potential solutions (I'd make them pluggable/configurable within beats) Could be:
Note: From AWS ELB docs it seems ELB is more acting like a TCP proxy then being DNS based (with the option of internet-facing nodes). I'm not sure right now if one can query all available IPs behind the load balancer in all cases.
Dropping healthy connections when DNS is invalid? Why would a user want a healthy connection to be destroyed?
Good question. Reasons I can think of:
Consider a new feature to "load balancing" mode in beats such that a user can say "Connect to this endpoint N times" in the hopes that connecting N times to the same load balancer software will result in multiple connections to different backends (although, with a load balancer, such guarantees of different backend connections are not necessarily possible)
This is already supported by beats, but given M beats connecting to N LS, plus simple round-robin load balancing, this most likely has not the desired effect if multiple beats are started at the same time.
Fair distribution of load (N servers, each server receiving exactly 1/N of the data) sounds nice, but I'm not sure how this helps more than the current model? The current model is this: If you are capable of processing M event rate per server, and you have N servers, you can do a maximum of N*M event rate. If several beats end up trying to send more than M rate to a single server, then some of those beats will timeout and reconnect to another server. This ensures the fleet of beats will cooperate by detecting when a server is overloaded (timeouts) and distributing to other servers.
As already mentioned, the timeout+reconnect case is not applicable, due to keep-alive signal from LS. Load balancing is not just about throughput, but it also simplifies error handling from beats point of view, as beats can still progress if one node becomes unavailable or is stuck. As we can't figure out the root-cause why an Logstash instance is stuck, we can hardly make an informed decision about forcing a reconnect to another instance or not. Plus we don't know if this would relief the situation or not. In the later case, reconnects could amplify the overload-situation.
About idle connections:
Callidryas: "A great enhancement would be to provide a time to live (TTL) config value to specify how long a TCP connection from Filebeat to Logstash should live."
This is a great idea!
We had a similar "issue" in our web service application (which uses "Connection: keep-alive") with GCP Loadbalancers. We implemented a CXF Interceptor that every x times (configurable) sends the request with HTTP-Header "Connection: close".
This works great, the load is now distributed evenly.
So I'd assume the idea proposed by @Callidryas will work.
I tried to get it working with this TTL field - but it simply is not working. See 7824
for anyone find this issue, its been fixed here with using TTLS - https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html#_ttl
can this be closed now ?
The issue is still open, because there is a limitation on the ttl
setting requiring users to also set pipelining: 0
. Without explicitely setting pipelining
, the ttl
setting will have no effect.
Is there any update? We are running logstash 7.3 in AWS behind ELB (5 instances) and only 2 instances are taking traffic. This is really causing an issue for us (log delay)
Pinging @elastic/integrations-services (Team:Services)
This ticket is approaching 5 years old. My company is in pain with problem of beats going silent when we use a single dnsname in the output.logstah.hosts to an aws load balancing that is in front of a fleet of container based logstash nodes. We have the ttl:120s and pipelining:0 and it is insufficient (beats still go silent).
This post mentions disabling dns cache. It is interesting and Ihave heard nothing from elastic or this ticket about that; so I am sharing. I am not yet sure if this is a viable approach for us yet.
Thoughts?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
+1 on this. We are also experiencing the same problem where Beats never close their persistent sessions, hence getting stuck with the Logstash server they originally connected to, unless we force kill their TCP sessions on Logstash.
We have configured the Beat clients to load balance between two different FQDNs, each belonging to a different data center (e.g. US-East and US-West). Each data center represents 30+ Logstash servers behind a load balancer.
output.logstash:
hosts: ["us-east-1.company.com", "us-west-1.company.com"]
loadbalance: true
ttl: 60s
We are seeing connections switching between the data centers with a ~50-50% load but they always end up glued to the same Logstash server at each region. We want to be able to periodically rebalance system load on the Logstash servers by having the Beats re-establish their sessions so the load balancer at each data center gets a chance to route them to the least busy Logstash server, but what's happening is that the Beats seem to keep both sessions always alive while load balancing between the two. We also added the TTL setting above to no avail.
For load balancing, we have tried a) script to remove busy Logstash servers from DNS response based on average system load and JVM heap; or b) using a Keepalived load balancer in front of Logstash with Least Connection LB algorithm. In both cases, we end up with sessions distributed unevenly across our Logstash servers, causing logs to queue up on some servers while others are sitting idle.
Our plan right now is to force kill TCP sessions on busy Logstash servers so the load balancers can do their job on when the beat try to re-establish its session, but that's not a very elegant solution.
Is it possible to add a setting to have the Beats periodically try to reestablish their sessions? Or perhaps adding an option in Logstash Beats input to gracefully close active sessions after a certain period of time?
@kkojouri You need to set pipelining: 0
. See urso's comment here: https://github.com/elastic/beats/issues/661#issuecomment-553433704
Docs:
The "ttl" option is not yet supported on an async Logstash client (one with the "pipelining" option set).
Hi! We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
.
Thank you for your contribution!
:+1:
π
π
A great enhancement would be to provide a time to live (TTL) config value to specify how long a TCP connection from Filebeat to Logstash should live. After that TTL, a new connection would be created. This would allow better distribution to Logstash instances behind a loadbalancer.
Since connections from filebeat to logstash are sticky, when an instance joins the load balancer, it does not get an equal distribution. For example, if there are 4 instances behind a load balancer and 3 of them are rebooted, then all filebeat connections will go to the single instance that was not rebooted. I am not aware of any other events, aside from failure, that would cause the connection to be reestablished to a potentially different server.
By specifying a TTL on the connection, there is an opportunity for the load balancer to distribute connections equally between the instances. Using a load balancer tends to be more convenient than updating the configuration of all filebeat clients in an environment.