Support longer checkin intervals when the agent status has not changed

joshdover commented 1 year ago

We've been doing scale testing over the past few months using a ~30 minute long poll duration (rather than current default of 5m) and we are seeing much better results for very large clusters.

We're now ready to make this the default setting for Fleet Server and Agent. These changes can happen independently and do not necessarily need to land in the same release, though it would be preferred. The corresponding Fleet Server changes are in tracked in:

https://github.com/elastic/fleet-server/issues/2337

There is some additional complexity to changing this on the Agent side, as we currently have an issue where Agent will not re-checkin with Fleet Server when it's health status changes. If we update the long polling interval to 30 minutes, this could result in the agent status in the UI being up to 30 minutes stale, rather than only 5 minutes stale.

To avoid this kind of regression, we need to update Agent to also cancel the current checkin and start a new one when status changes, however we will cap the frequency of this to 5 minutes to avoid any extra load on large Fleets. We will investigate increasing the frequency that Agent updates this further separately from this change, see https://github.com/elastic/elastic-agent/issues/1946.

Tasks

The client side timeout in Agent should be longer than Fleet Server (28m) or the proxy's timeout (30m 20s). We'll keep a similar buffer here at 5 minutes over what the proxy will timeout at and timeout at 35 minutes from the client.

[ ] Add the ability cancel a checkin and start a new one when the status changes, with a 5 minute debounce
[ ] When there is a request error during checkin the log message should link the troubleshooting guide page
[ ] Update default for fleet.timeout to 35 minutes: https://github.com/elastic/elastic-agent/blob/c0976970f2317dd79a9056ea1ee2044b72485713/internal/pkg/remote/config.go#L49

michel-laterman commented 1 year ago

How will the action queue for scheduled actions be checked with a longer poll time?

EDIT: I've added a separate timer to dispatch scheduled actions in a managed agent https://github.com/elastic/elastic-agent/pull/2344

cmacknz commented 1 year ago

Changed the description to "Support longer checkin intervals when the agent status has not changed" since we aren't going to increase the default timeout when this issue closes.

pchila commented 1 year ago

after a quick clarification with @cmacknz :

This change will not be in 8.8 but we are targeting 8.9 only.
We need to add a migration for agents < 8.9 where we update the old default timeout value to 7 minutes in order to have checkin intervals of ~ 5 minutes
We need to set the elastic agent state debounce to a value of 7 minutes to avoid a race between fleet server and elastic agent at the end of a long poll
We need to be able to migrate the debounce value on upgrade as well.
Debounce settings should not be part of the fleet.enc in the first implementation (to avoid the problem of an older default value overriding a new one)

cmacknz commented 1 year ago

I added the timeout configuration migration to a separate issue in https://github.com/elastic/elastic-agent/issues/2597 for tracking.

Also created https://github.com/elastic/elastic-agent/issues/2598 so track updating horde to use the new checkin parameter in requests with a 7m timeout.

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elastic / elastic-agent

Support longer checkin intervals when the agent status has not changed #2257

Tasks