Open toby-sutor opened 2 months ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Thanks, the defaults are a compromise of many use cases. If we get more reports for this we can re-evaluate the value.
Iff you haven't already you can change this. https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#_preset
The latency
preset is approximately what the old defaults were. Depending on your data volume you may better off with the throughput
preset. Or you could just change the idle_connection_timeout
by itself if you haven't already (this requires not setting a preset
or using preset: custom
).
The balanced preset avoids keep alive intentionally as each Agent would be keeping 6-8 active connections to Elasticsearch in normal use. As @cmacknz mentioned, this is a trade-off across use-cases. In low-throughput use-cases, the balanced setting should result in <1 DNS query every 2 seconds in most use-cases which I wouldn't consider to be excessive.
If the customer has a large number of low-throughput Agents, they may find that the scale
preset is more appropriate for their use-case as Agents send data less often and thus perform fewer requests (even though the idle timeout is lower), and by extension fewer DNS requests. Or they can follow @cmacknz 's recommendation and customize the settings as needed.
Similarly, if they are medium to high throughput clients, the throughput
preset ensures that the connection is kept alive in most cases (though it also increases worker count and max memory consumption)
Describe the enhancement: With version 8.12. we changed the default connection timeout from Beats/Agent to Elasticsearch from 60 to 3 seconds. As a result, Agents have to reconnect to Elasticsearch more frequently, leading to potential situations where the DNS severs might get spammed. In most cases, this does not seem to be an issue when a local DNS server is installed on the OS. However, in some scenarios this is not the default or desired, leading to unexpected high network requests. As such, it is questionable if closing connections after three seconds is feasible, given that this is supposed to be an HTTP connection where users would expect an active keep.alive. A more balanced value like 10-30 seconds might be a better compromise for the default value.
References: