[Fleet] Detect when active API key count in security index does not align with enrolled agents

kpollich commented 2 months ago

In general, Fleet provisions two API keys for each enrolled Elastic Agent. We should have a check that fires during Fleet setup to ensure that the count of active API keys in Elasticsearch aligns with the count of agents enrolled.

If this check fails, we should display a warning callout at the top of the Fleet app that links to a troubleshooting guide for issues with agent API keys.

elasticmachine commented 2 months ago

Pinging @elastic/fleet (Team:Fleet)

jlind23 commented 2 months ago

@kpollich thanks for creating this. @nimarezainia to chime in here too.

The main problem here came from the fact that a user can delete/misbehave with their security index which will result eventually in a escalation on our end. What options do we have to inform / warn users as much as possible in order to avoid such scenario. This warning message is one of the option but this will not work if users are exclusively using APIs to perform actions.

Any other ideas / suggestions?

cmacknz commented 2 months ago

What situations would actually cause this? If this only happens if someone manually manipulates the security index, I think this is treating the symptom and not the cause.

Why would someone manually interact with the security index? Were they trying to revoke an API key (this is force unenroll), rotate agent API keys (this is a missing Fleet feature), simulate a disaster recovery scenario?

I think it should be obvious that if you start blowing away parts of Fleet's internal state without going through Fleet itself, you are going to be in trouble. If we can't actually prevent this, or provide a UX for the thing a user was trying to accomplish, let's focus instead of having a dedicated recovery path for this situation.

If you were to delete the security index and all of the agent API keys for example, you'd be left with a collection of enrolled agents that can't authenticate with fleet that you want to keep. We could for example have a re-enroll endpoint that preserves the agent ID but grants a new API key, subject to having to do this from the host agent is running on.

jlind23 commented 2 months ago

If you were to delete the security index and all of the agent API keys for example, you'd be left with a collection of enrolled agents that can't authenticate with fleet that you want to keep. We could for example have a re-enroll endpoint that preserves the agent ID but grants a new API key, subject to having to do this from the host agent is running on.

But this would mean that you will have to clean up the Agent ID that you don't want to see appearing again otherwise they'll be able to eventually reenroll.

Why would someone manually interact with the security index? Were they trying to revoke an API key (this is force unenroll), rotate agent API keys (this is a missing Fleet feature), simulate a disaster recovery scenario?

Security index is not only a fleet specific index so there might be other reasons why users might want to interact with it but anyway they should not be anything.

cmacknz commented 2 months ago

But this would mean that you will have to clean up the Agent ID that you don't want to see appearing again otherwise they'll be able to eventually reenroll.

Which is why we can consider building a way to have agents enroll again but preserve their existing agent IDs, so that a user in this situation doesn't have to do this cleanup at all.

nimarezainia commented 1 month ago

if the check is during the Fleet setup it may not really be catching this error condition, in particular the user problem being referenced. Agree that users shouldn't be playing around with internal indices as such.

Could we look that something like this check as part of diagnostics/self-repair type motion where the user realizes something is not working and can issue an action to discover and then we can enhance this to add a self-repair option. So it's user initiated rather than some sort of a cron job that runs regularly (think 100k agents).

We can expand this over time to include other self repair actions. Things like issuing a reset to the agent that is stuck OR testing connection to an output etc.

(although it might just be best to report these errors in the UI - agent details page. For example if the API keys are invalid, or if the output configured is not reachable etc.

jlind23 commented 1 month ago

Which is why we can consider building a way to have agents enroll again but preserve their existing agent IDs, so that a user in this situation doesn't have to do this cleanup at all.

If we forbid unenrolled agent to enrol again then we are good with this.

elastic / kibana

[Fleet] Detect when active API key count in security index does not align with enrolled agents #189071