Specify a TTL for connection from Filebeat to Logstash

Callidryas commented 8 years ago

A great enhancement would be to provide a time to live (TTL) config value to specify how long a TCP connection from Filebeat to Logstash should live. After that TTL, a new connection would be created. This would allow better distribution to Logstash instances behind a loadbalancer.

Since connections from filebeat to logstash are sticky, when an instance joins the load balancer, it does not get an equal distribution. For example, if there are 4 instances behind a load balancer and 3 of them are rebooted, then all filebeat connections will go to the single instance that was not rebooted. I am not aware of any other events, aside from failure, that would cause the connection to be reestablished to a potentially different server.

By specifying a TTL on the connection, there is an opportunity for the load balancer to distribute connections equally between the instances. Using a load balancer tends to be more convenient than updating the configuration of all filebeat clients in an environment.

ranrub commented 8 years ago

+1 we have changing load on our cluster of logstashes during the day, and the easiest way to autoscale is via ELB. Reconnect should be attempted, and DNS should be re-queried at the same time (either respect DNS TTL or just don't cache)

jwhited commented 8 years ago

+1 for forced TCP close / worker re-gen interval

Musashisan commented 8 years ago

+1 This will be great !

fiunchinho commented 8 years ago

any updates on this one?

filewalkwithme commented 8 years ago

Hi, I am trying to solve this by creating a TTL config on logstash output plugin.

This TTL config will contain the number of seconds to maintain the connection alive. If this number is greater than zero, then we start a timer which will be reset after 'n' seconds (where 'n' = ttl in seconds)

Before sending events, we check if our timer has expired. If yes, then we close the client & connect again.

I did an initial commit here: https://github.com/maiconio/beats/commit/274970c5f3af9bc46d16021f8a995acae951e09e and in my experiments everything is working as expected (including the test suite).

I am willing to create a PR, but before doing that I would like some feedback and some guidance on how to improve this.

Thanks

fiunchinho commented 8 years ago

But that doesn't solve the problem with filebeat, does it?

filewalkwithme commented 8 years ago

Hi @fiunchinho. I think that it does.

Filebeat is built on top of libbeat library and his plugins. This way, I think that the right place to implement a TTL config would be inside Libbeat Logstash output plugin.

I did setup a test environment with one Filebeat instance writing to 3 logstash instances running behind an Nginx loadbalancer. In my tests, the Filebeat instance keeps connected to the same Logstash instance until we reach the TTL limit. After that the client gets disconected and then connects again, this time with another Logstash instance (accordingly Load Balancer round-robin algorithm).

fiunchinho commented 8 years ago

Sounds about right 👍

mazurio commented 8 years ago

@maiconio any update on this issue? :) 👍

joshuaspence commented 7 years ago

+1. We are autoscaling our Logstash hosts, but we have found that new Logstash hosts aren't receiving any traffic unless we restart Filebeats everywhere.

mazurio commented 7 years ago

@joshuaspence

You can go around this issue by specifing the DNS multiple times e.g. for us it would be Filebeat -> Logstash -> Elasticsearch.

Filebeat config would contain:

["internal.logstash.XXX.co.uk", "internal.logstash.XXX.co.uk", "internal.logstash.XXX.co.uk"]

Logstash contains:

  elasticsearch {
    hosts => ["{{elasticsearch_host}}", "{{elasticsearch_host}}", "{{elasticsearch_host}}", "{{elasticsearch_host}}"]
    index => "%{component}-%{+YYYY-MM-dd}"
}

This allows both Filebeat and Logstash to renew TCP connection. We can then auto scale logstash and scale up elasticsearch when needed without seeing any CPU issues on the services.

jespereneberg commented 7 years ago

@mazurio Do you have loadbalance: true configured on the filebeat?

@maiconio Any news on that pull request? This seems like a well thought out feature and suits well with ELB and similar LBs.

mazurio commented 7 years ago

Hi @Bizzelicious,

It's set to:

output:
  logstash:
    bulk_max_size: 100
    timeout: 15
    hosts: {{hosts}}
    port: 5044
loadbalance: true

PhaedrusTheGreek commented 7 years ago

+1 for this feature, as sometimes there is a need re-establish connection based on a Global DNS load balancing change.

PhaedrusTheGreek commented 7 years ago

To modify my previous statement,

One possible solution is to set a TCP connection timeout on the load balancer (which might actually be the better solution!).

Since we strive to support load balancing on our own within beats, I agree that that code should probably manage this sort of pain. In particular, if as part of the endpoint health checks, we could check if DNS is still valid, that would be stellar.

Would it be a fair assumption that if DNS changes, then we should abandon an old connection and re-establish? (As part of supporting load balancing behaviour).

So to summarize my opinion:

This issue can be solved by setting the TTL in the load balancer
Another issue should be created to better manage beats' internal load balancing behaviour when DNS of an endpoint changes.

jespereneberg commented 7 years ago

@PhaedrusTheGreek

Not all load balancers can kill an established connection. AWS ELB and ALB can only timeout idle connections for example.

PhaedrusTheGreek commented 7 years ago

@Bizzelicious noted. And technically RST'ing a connection is not as clean as client software deciding that a connection has been used for long enough, closing it cleanly, and re-establishing.

jordansissel commented 7 years ago

Since connections from filebeat to logstash are sticky, when an instance joins the load balancer, it does not get an equal distribution

Equal distribution of connections does not mean equal distribution of load. A single TCP connection costs about the same as 10,000 TCP connections. The data flowing over a connection is the load for Beats.

A load balancer is unlikely to effectively balance load for beats because the "load" (the thing which consumes cpu/network resource) in beats is an event, and there is no general correlation between load and connection count. 10,000 beats connections could be idle and not sending any data; 1 beat could be sending 100,000 events per second.

Beats supports a loadbalance mode that will try to distribute events across multiple connections to multiple Logstash endpoints. When new connections are needed, Beats will do a dns lookup (every time) and new servers appearing in the DNS entry will be considered for connection.

Beats output to Logstash, as described above, was designed to not need a load balancer at all in order to rendezvous with multiple Logstash servers. It's OK if you want to use a load balancer for some operational ability to adding/removing backends, though (instead of using DNS for this purpose, for example).

Some questions. If you add a new server to your loadbalancer backend:

What is the negative impact of still writing to the old connection if a new server is added to your load balancer?
Should all still-healthy connections be immediately abandoned when a new endpoint is found?

TTL on a connection is a solution that focuses only on the idea that the connection itself is the load, and as said, it is not.

Some talking points I like to mention with respect to load balanncing and data streams:

Connections are not load.
Equal distribution of connections does not equally distribute load.
The beats project is capable of loadbalancing on its own.

The way I see the proposal, it's hoping that a TTL to destroy healthy connections will make Beats balance load. Destroying healthy connections will result in Beats retransmitting and causes duplicates to end up in Logstash and Elasticsearch.

Stepping back, I see the problem basically this: Users have a load balancer, they want to add and remove backends, and they want beats to make use of new servers as they are added to the load balancer.

I propose altering the solution proposed:

When load balancing is enabled in beats, endpoint addresses should be periodically checked to sniff for new endpoint systems. This can apply to both Elasticsearch sniffing (dns, node list) and Beats protocol sniffing (dns)
When load balancing is enabled in beats, and a new endpoint system is found, that system should be added to the pool of available endpoints.
Healthy connections are left undisturbed.
Consider a new feature to "load balancing" mode in beats such that a user can say "Connect to this endpoint N times" in the hopes that connecting N times to the same load balancer software will result in multiple connections to different backends (although, with a load balancer, such guarantees of different backend connections are not necessarily possible)

PhaedrusTheGreek commented 7 years ago

Standard Load Balancing && (Tcp Connection Count != Load) Although TCP connections don't constitute load, distributing TCP connections will generally give you distributed load the majority of the time. This may be a moot point, however because it's agreed that Beats can be a more precise load balancer.

P.S. Good load balancers will determine the true load, not just the TCP connection count, in order to determine where to route the incoming TCP connection - but either way, a new TCP connection is required.

Global Load Balancing (DNS Load Balancing) Additionally, there is a case where Beats is sending data to Logstash behind 2 Load Balancers, that are active/inactive based on DNS via a Global Load Balancer. In this case, distribution of load is not the consideration, but awareness of the site's "active/fail" status. In this case, should beats have to be restarted? If not, then should connections with invalid DNS be closed somehow?

jordansissel commented 7 years ago

@PhaedrusTheGreek Well, a failure (timeout, network interrupt, etc) will cause beats to need a new connection which does a dns lookup. If you're saying a "failure" a situation which is undetectable, then there would be no way for beats to determine that a new connection is needed. It feels like the assumption here is that the "old" entries in the dns or load balancer are still healthy and working? If so, why would a healthy entry be removed from your dns/balancer pool?

I still don't think I understand the problem. Here's my pitch:

the beats protocol requires acknowledgements from Logstash that it has accepted a set of events.
if no acknowledgement occurs within a given time period, beats assumes Logstash to be dead or stuck, and it will pick a new connection to retransmit and use.

With the above, load will be distributed automatically in such a way that no single Logstash is overloaded. If one Logstash becomes overloaded, many beats will detect this (through acknowledgement timeouts) and will try alternative Logstash servers.

Fair distribution of load (N servers, each server receiving exactly 1/N of the data) sounds nice, but I'm not sure how this helps more than the current model? The current model is this: If you are capable of processing M event rate per server, and you have N servers, you can do a maximum of N*M event rate. If several beats end up trying to send more than M rate to a single server, then some of those beats will timeout and reconnect to another server. This ensures the fleet of beats will cooperate by detecting when a server is overloaded (timeouts) and distributing to other servers.

distributing TCP connections will generally give you distributed load the majority of the time

This is not true. It may be true for some users, but it is not true in general.

Good load balancers will determine the true load

With this definition, have never experienced a good load balancer. Certainly none are likely to exist that are aware of what constitutes load for the beats network protocol? Beats itself is capable of doing this data distribution without need for any middleware.

PhaedrusTheGreek commented 7 years ago

Maybe we can approach this from another angle:

Do we support DNS load balancing? If so, what constitutes support?
- Intermittent DNS lookups on healthy connections?
- Dropping healthy connections when DNS is invalid?
Do we support TCP load balancers between Beats and Logstash?
- If TCP connections will live indefinitely, then what do we recommend for a re-distribution workaround?

Currently the only answer to any of the above seems to be - restart beats!

jordansissel commented 7 years ago

Do we support DNS load balancing? If so, what constitutes support?

I will run through the code and diagram what happens. This is a good question to have a nice answer for :)

what do we recommend for a re-distribution workaround?

I personally do not recommend load balancers for the purposes of distributing connections because beats (and logstash-forwarder before it) already does this.

If it's helpful operationally to use a load balancer to allow adding/removing of backends, then I think this is useful. For distributing load, however, I do not think any general load balancer is going to do effectively because my experience is that load balancers assume one connection is one unit of load, which, as said, is not correct for beats.

jordansissel commented 7 years ago

Dropping healthy connections when DNS is invalid?

Why would a user want a healthy connection to be destroyed?

urso commented 7 years ago

I still don't think I understand the problem. Here's my pitch:

the beats protocol requires acknowledgements from Logstash that it has accepted a set of events.

if no acknowledgement occurs within a given time period, beats assumes Logstash to be dead or stuck, and it will pick a new connection to retransmit and use.

With the above, load will be distributed automatically in such a way that no single Logstash is overloaded. If one Logstash becomes overloaded, many beats will detect this (through acknowledgement timeouts) and will try alternative Logstash servers.

I think we're missing some system-end-to-end view in the protocol itself regarding active 'transactions' between beats and logstash.

There is a difference between timeout and timeout so to say. Here is the catch: partial ACK acts like a keep-alive. That is, logstash will send ACK 0 every 5 seconds in case a batch is progress. This kind of keep-alive is used by logstash to signal it's still actively processing the batch. It acts as a back-pressure signal. The keep-alive is used to fight the old LSF issue of resending a big batch of events over and over and over again due to continues timeouts cause by a) logstash dealing with back-pressure b) crazy-grok pattern on bad input (slow filters) c) to big of a batch to be processable in time (here we've got some slow start windowing support, but it's more of a hack). If no keep-alive is received from logstash within time, the connection is assumed to be broken and beats reconnect + resend. This is clearly a change in behavior between LSF and beats. Pro:

we don't kill/overload a busy/blocked downstream systems with duplicates
still can 'detect' connection loss (LS being down) in relative short time frame Cons:
we can not deal with logstash being stuck use-case, in cases we'd rather want to connect to another LS instance
case of too many beats overloading a subset of LS instances can not be dealt with properly (unless LS kills connection or suppresses the keep-alive signal)

Beats simply don't know if logstash is stuck or is waiting for outputs (back-pressure). Beats can not cancel transactions (remove events from LS queues) as well. On the other hand, LS doesn't know if beats uses load-balancing or not. That is, it can not just suppress the ACK 0 signal. The current behaviour is more a trade-off between beats killing downstream systems and potential error detection based on the very localized systems view beats and logstash do have about each other.

Given our preferred use case is hundreds of beats sending to a low number of LS servers, in the load-balancing scenario I can see two modes (currently supported by beats/LS):

1) every beat connects to one LS instance only (random if configured in beats, round-robin with DNS based load balancer). The disadvantage is, connections and load might not be well distributed. Plus, with LS sending keep-alive signals, connections won't balance out over time. For LSF connections might balance out, due to timeout being effective, but main problem is this: LSF/beats can not really tell the difference in downstream state and act accordingly (wait or reconnect?).

2) every beat connects to all known LS instances. The load balancer in beats uses a shared work-queue. Slow endpoints will receive less events then faster endpoints. This way the overall systems throughput has a chance to balance out. Only if all LS are stuck, will the full system grind to a halt. This is basically how kafka operates, just have all producers send data to all brokers.

Having beats send to all known LS instances also adds the advantage of beats/LS not having to distinguish between sources of LS slow-downs. We just don't care. A slower/blocked/unavailable LS instance will just not receive any more data.

Not hey possible, but still we might have to update beats to 'cancel' a batch in order to forward it to another endpoint if some timeout hits. The connection will be closed and beats reconnect to the unavailable/slow node after some timeout. This way, a stuck node can not fully block e.g. filebeat requiring an ACK before it can continue. The main difference is, this full-ack-timeout is only used if:

Load balancing has multiple active workers
In fail-over-mode (just one active LS output, but connect to another node). But in difference to LSF:
1. connect to alternative endpoint
2. send batch
3. wait for ack
4. depending on ack from connection1 or 2 (and/or total time to ACK), close one of the active connections (maybe include some randomness, so not all beats reconnect to the same instance).

Problem is with users wanting to add/remove LS instances on the fly. It's kind of nightmarish to handle/configure right now. Some bootstrapping + regular connection update by beats would be great. Potential solutions (I'd make them pluggable/configurable within beats) Could be:

Logstash instances forming a cluster with API for beat to query every know and then about known logstash endpoints. Approach adds quite some complexity we don't really neeed I don't want to deal with.
DNS SRV or A record lookups.
Kubernetes/Dockerswarm event stream to learn about newly started instances

Note: From AWS ELB docs it seems ELB is more acting like a TCP proxy then being DNS based (with the option of internet-facing nodes). I'm not sure right now if one can query all available IPs behind the load balancer in all cases.

Dropping healthy connections when DNS is invalid? Why would a user want a healthy connection to be destroyed?

Good question. Reasons I can think of:

LS waiting for all connections to be gone before shutting down. That is, beats might wait for current batch to be ACKed, but remove the LS instance from its outputs. This can give LS a chance to drain its internal pipelines (finish ingestion of waiting events), before shutting down. More like a graceful disconnect then a hard-in-the-middle-of-processing-data disconnect.
Configuring 2 hostnames for an Logstash instance. A subset of beats will report to hostname 1 and the other subset to hostname 2. This way I can migrate hostname 1 to another Logstash cluster, without affecting the cluster using hostname 2 yet.
Migrating a beats to another Logstash cluster or freeing a Logstash node for maintenance/testing purposes. Have them not send to a node, so the node can be reconfigured or testedin isolation, without having to stop the instance.

Consider a new feature to "load balancing" mode in beats such that a user can say "Connect to this endpoint N times" in the hopes that connecting N times to the same load balancer software will result in multiple connections to different backends (although, with a load balancer, such guarantees of different backend connections are not necessarily possible)

This is already supported by beats, but given M beats connecting to N LS, plus simple round-robin load balancing, this most likely has not the desired effect if multiple beats are started at the same time.

Fair distribution of load (N servers, each server receiving exactly 1/N of the data) sounds nice, but I'm not sure how this helps more than the current model? The current model is this: If you are capable of processing M event rate per server, and you have N servers, you can do a maximum of N*M event rate. If several beats end up trying to send more than M rate to a single server, then some of those beats will timeout and reconnect to another server. This ensures the fleet of beats will cooperate by detecting when a server is overloaded (timeouts) and distributing to other servers.

As already mentioned, the timeout+reconnect case is not applicable, due to keep-alive signal from LS. Load balancing is not just about throughput, but it also simplifies error handling from beats point of view, as beats can still progress if one node becomes unavailable or is stuck. As we can't figure out the root-cause why an Logstash instance is stuck, we can hardly make an informed decision about forcing a reconnect to another instance or not. Plus we don't know if this would relief the situation or not. In the later case, reconnects could amplify the overload-situation.

About idle connections:

beats do not connect on startup, but only if they have to send data
Logstash input plugin closes inactive connections after some given timeout

hangstl commented 6 years ago

Callidryas: "A great enhancement would be to provide a time to live (TTL) config value to specify how long a TCP connection from Filebeat to Logstash should live."

This is a great idea!

We had a similar "issue" in our web service application (which uses "Connection: keep-alive") with GCP Loadbalancers. We implemented a CXF Interceptor that every x times (configurable) sends the request with HTTP-Header "Connection: close".

This works great, the load is now distributed evenly.

So I'd assume the idea proposed by @Callidryas will work.

hangstl commented 6 years ago

I tried to get it working with this TTL field - but it simply is not working. See 7824

biohazd commented 5 years ago

for anyone find this issue, its been fixed here with using TTLS - https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html#_ttl

biohazd commented 5 years ago

can this be closed now ?

urso commented 5 years ago

The issue is still open, because there is a limitation on the ttl setting requiring users to also set pipelining: 0. Without explicitely setting pipelining, the ttl setting will have no effect.

nkshah commented 4 years ago

Is there any update? We are running logstash 7.3 in AWS behind ELB (5 instances) and only 2 instances are taking traffic. This is really causing an issue for us (log delay)

elasticmachine commented 4 years ago

Pinging @elastic/integrations-services (Team:Services)

richard-mauri commented 4 years ago

This ticket is approaching 5 years old. My company is in pain with problem of beats going silent when we use a single dnsname in the output.logstah.hosts to an aws load balancing that is in front of a fleet of container based logstash nodes. We have the ttl:120s and pipelining:0 and it is insufficient (beats still go silent).

This post mentions disabling dns cache. It is interesting and Ihave heard nothing from elastic or this ticket about that; so I am sharing. I am not yet sure if this is a viable approach for us yet.

https://medium.com/@manoj.senguttuvan/setting-up-a-load-balanced-logstash-behind-and-aws-elb-cc793bf9fda4

Thoughts?

botelastic[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

kkojouri commented 2 years ago

+1 on this. We are also experiencing the same problem where Beats never close their persistent sessions, hence getting stuck with the Logstash server they originally connected to, unless we force kill their TCP sessions on Logstash.

We have configured the Beat clients to load balance between two different FQDNs, each belonging to a different data center (e.g. US-East and US-West). Each data center represents 30+ Logstash servers behind a load balancer.

output.logstash:
  hosts: ["us-east-1.company.com", "us-west-1.company.com"]
  loadbalance: true
  ttl: 60s

We are seeing connections switching between the data centers with a ~50-50% load but they always end up glued to the same Logstash server at each region. We want to be able to periodically rebalance system load on the Logstash servers by having the Beats re-establish their sessions so the load balancer at each data center gets a chance to route them to the least busy Logstash server, but what's happening is that the Beats seem to keep both sessions always alive while load balancing between the two. We also added the TTL setting above to no avail.

For load balancing, we have tried a) script to remove busy Logstash servers from DNS response based on average system load and JVM heap; or b) using a Keepalived load balancer in front of Logstash with Least Connection LB algorithm. In both cases, we end up with sessions distributed unevenly across our Logstash servers, causing logs to queue up on some servers while others are sitting idle.

Our plan right now is to force kill TCP sessions on busy Logstash servers so the load balancers can do their job on when the beat try to re-establish its session, but that's not a very elegant solution.

Is it possible to add a setting to have the Beats periodically try to reestablish their sessions? Or perhaps adding an option in Logstash Beats input to gracefully close active sessions after a certain period of time?

eiden commented 2 years ago

@kkojouri You need to set pipelining: 0. See urso's comment here: https://github.com/elastic/beats/issues/661#issuecomment-553433704

Docs:

The "ttl" option is not yet supported on an async Logstash client (one with the "pipelining" option set).

botelastic[bot] commented 1 year ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

ethervoid commented 1 year ago

:+1:

pierredeman commented 3 months ago

👍

lduriez commented 3 months ago

👍

elastic / beats

Specify a TTL for connection from Filebeat to Logstash #661