elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Support longer checkin intervals when the agent status has not changed #2257

Open joshdover opened 1 year ago

joshdover commented 1 year ago

We've been doing scale testing over the past few months using a ~30 minute long poll duration (rather than current default of 5m) and we are seeing much better results for very large clusters.

We're now ready to make this the default setting for Fleet Server and Agent. These changes can happen independently and do not necessarily need to land in the same release, though it would be preferred. The corresponding Fleet Server changes are in tracked in:

There is some additional complexity to changing this on the Agent side, as we currently have an issue where Agent will not re-checkin with Fleet Server when it's health status changes. If we update the long polling interval to 30 minutes, this could result in the agent status in the UI being up to 30 minutes stale, rather than only 5 minutes stale.

To avoid this kind of regression, we need to update Agent to also cancel the current checkin and start a new one when status changes, however we will cap the frequency of this to 5 minutes to avoid any extra load on large Fleets. We will investigate increasing the frequency that Agent updates this further separately from this change, see https://github.com/elastic/elastic-agent/issues/1946.

Tasks

The client side timeout in Agent should be longer than Fleet Server (28m) or the proxy's timeout (30m 20s). We'll keep a similar buffer here at 5 minutes over what the proxy will timeout at and timeout at 35 minutes from the client.

michel-laterman commented 1 year ago

How will the action queue for scheduled actions be checked with a longer poll time?

EDIT: I've added a separate timer to dispatch scheduled actions in a managed agent https://github.com/elastic/elastic-agent/pull/2344

cmacknz commented 1 year ago

Changed the description to "Support longer checkin intervals when the agent status has not changed" since we aren't going to increase the default timeout when this issue closes.

pchila commented 1 year ago

after a quick clarification with @cmacknz :

cmacknz commented 1 year ago

I added the timeout configuration migration to a separate issue in https://github.com/elastic/elastic-agent/issues/2597 for tracking.

Also created https://github.com/elastic/elastic-agent/issues/2598 so track updating horde to use the new checkin parameter in requests with a 7m timeout.

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)