elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
96 stars 4.92k forks source link

Reconsider the timeout value of connection_idle_timeout / idle_connection_timeout #40749

Open toby-sutor opened 2 months ago

toby-sutor commented 2 months ago

Describe the enhancement: With version 8.12. we changed the default connection timeout from Beats/Agent to Elasticsearch from 60 to 3 seconds. As a result, Agents have to reconnect to Elasticsearch more frequently, leading to potential situations where the DNS severs might get spammed. In most cases, this does not seem to be an issue when a local DNS server is installed on the OS. However, in some scenarios this is not the default or desired, leading to unexpected high network requests. As such, it is questionable if closing connections after three seconds is feasible, given that this is supposed to be an HTTP connection where users would expect an active keep.alive. A more balanced value like 10-30 seconds might be a better compromise for the default value.

References:

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 1 month ago

Thanks, the defaults are a compromise of many use cases. If we get more reports for this we can re-evaluate the value.

Iff you haven't already you can change this. https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#_preset

The latency preset is approximately what the old defaults were. Depending on your data volume you may better off with the throughput preset. Or you could just change the idle_connection_timeout by itself if you haven't already (this requires not setting a preset or using preset: custom).

strawgate commented 1 month ago

The balanced preset avoids keep alive intentionally as each Agent would be keeping 6-8 active connections to Elasticsearch in normal use. As @cmacknz mentioned, this is a trade-off across use-cases. In low-throughput use-cases, the balanced setting should result in <1 DNS query every 2 seconds in most use-cases which I wouldn't consider to be excessive.

If the customer has a large number of low-throughput Agents, they may find that the scale preset is more appropriate for their use-case as Agents send data less often and thus perform fewer requests (even though the idle timeout is lower), and by extension fewer DNS requests. Or they can follow @cmacknz 's recommendation and customize the settings as needed.

Similarly, if they are medium to high throughput clients, the throughput preset ensures that the connection is kept alive in most cases (though it also increases worker count and max memory consumption)