elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
16 stars 144 forks source link

Remove automatic unenrollment after 7 Fleet authentication failures #5428

Open cmacknz opened 2 months ago

cmacknz commented 2 months ago

Today Elastic Agent will unenroll itself automatically after receiving 7 consecutive 401 responses from Fleet when checking in. This was done to prevent agents that have been forced unenrolled (which revokes their API key) from checking in continuously until they can be re-installed.

https://github.com/elastic/elastic-agent/blob/590c506aea6f278200d024e65d0bc7e1c8b5238a/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go#L26-L28

https://github.com/elastic/elastic-agent/blob/590c506aea6f278200d024e65d0bc7e1c8b5238a/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go#L360-L363

https://github.com/elastic/elastic-agent/blob/590c506aea6f278200d024e65d0bc7e1c8b5238a/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go#L329-L341

This prevents force unenrolled agents from continuing to contact Fleet Server, but represents an edge case that can be hit in disaster recovery situations. To eliminate the chance that users recovering their cluster need to manually intervene on machines, we should stop unenrolling and instead greatly increase the checkin interval.

The initial proposal is that instead of unenrolling, we should switch to checking in once per hour. A successful checkin must return the agent to its original checkin interval.

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)