Reduce load and log noise when agents are force unenrolled

joshdover commented 2 years ago

When agents are "force" unenrolled, they can continue to attempt to check-in with Fleet Server using invalid API keys. This can produce a lot of noise in the Fleet Server and Elasticsearch logs on each check in attempt by these agents. This is common scenario when using Agent inside VMs and containers where instances may be reverted to a snapshot or spun back up after being force unenrolled. These instances will create constant error logs and load in Elasticsearch when attempting to validate these invalidated keys:

Steps to reproduce this scenario:

Enrolling an agent in Fleet on a VM
Shutdown or suspend the VM
From the Fleet UI, unenroll the agent with "remove now" checkbox selected
Start up or resume the VM
Observe apikey failed logs in Fleet Server and Elasticsearch

Potential solutions to this problem (not mutually exclusive or exhaustive):

Return a more informative error message back to Elastic Agent that will stop the agent from executing / continuing to check in with Fleet Server
Cache invalidated API keys in memory to avoid the need to check with Elasticsearch
Throttle logging for repeated invalid API key check ins

joshdover commented 2 years ago

Related to this is a request to surface details about unenrolled agents that are attempting checkin in the UI: https://github.com/elastic/kibana/issues/132702

ph commented 2 years ago

Return a more informative error message back to Elastic Agent that will stop the agent from executing / continuing to check in with Fleet Server

I would prefer not changing the behavior in the Elastic Agent loops, I'd assume this could be temporary depending on the setup.

Cache invalidated API keys in memory to avoid the need to check with Elasticsearch

This would not work in the context of multiple Fleet-Server.

Throttle logging for repeated invalid API key check-ins.

I will make a PR to reduce the number of log generated for this, I still think we need to see something in the log for now, but we don't need it to see if for all the Agent that will try to connect.

ph commented 2 years ago

OH I see, this warning is actually coming from the ES layer. So in that case we could indeed have caching that would keep track of failed API Keys, for a period of time maybe 1 hour?

elastic / fleet-server

Reduce load and log noise when agents are force unenrolled #1496