[Enhancement] Improvements Around Elastic Agent Health Checks

BenB196 commented 2 years ago

Proposal

By default, Elastic Agent CRDs don't have any health checks (Readiness, Liveness, Startup), while this makes some sense, but I think there is room for at least some documentation improvements (if not other improvements as well).

Why does this matter?

If you don't have any checks, then Kubernetes will always just assume that the pod is healthy and therefore route traffic to it.

Use Case 1; Bad Pod in the Bunch

I recently noticed this as an issue on one deployment where a majority of my Elastic Agents would go unhealthy for a good period of time (~15-20 minutes). After investigating further, I determined the issue to be that one of the Elastic Agent Fleet Server pods that was receiving traffic was having its credentials rotated and therefore wasn't really in a "healthy" state, yet it was showing as healthy in Kubernetes.

This led me to discover that there are in fact no default health checks for Elastic Agents by default.

Use Case 2; Upgrades/New Deployment

If you are performing any sort of upgrade on a current deployment, or are creating a new deployment, you want to actually be sure that the changes you are making are working as intended. Currently, with the defaults, if you were to do an upgrade or a new deployment, it will always be "healthy" from the Kubernetes perspective. This could be bad if the user doesn't know that there are no default health checks, and think that everything is actually healthy and continue about their day, until at some point they realize that this is not the case.

Enhancement 1

In the documentation for the Elastic Agent (Fleet managed + Standalone), add a note/section stating that the user needs to define their own health checks. While this might seem somewhat obvious, it caught me somewhat off guard, as a lot of the other resources that the ECK operator manages come with default checks, so I figured why wouldn't the Elastic Agent?

Enhancement 2

Document some good starting/base checks that people can build from.

While I understand that the Elastic Agent is a somewhat hard thing to get right when it comes to checks as things can be very dynamic (policies changing, ports are user configuration, etc...) and "healthy" might very depending on this, I think that providing at least some examples of checks would help new user who may not be as familiar with Kubernetes, at least get started with a good/"healthy" configuration.

Enhancement 3

Provide at least a basic default readiness check. The Elastic Agent has a way of checking its health /opt/Elastic/Agent/elastic-agent status (it may also have an API?, I'm not too sure though). It would be nice if there was a basic, default readiness check that looked at the health of the agent and made sure that it was actually "healthy".

Example command:

/opt/Elastic/Agent/elastic-agent status --output json | jq -e '.Status == 2 and (.Applications | length > 0)'

It can be used as a way of telling that an Agent is in a healthy state and ready for use.

Explanation of the command:

.Status == 2 Agent is in a healthy state
(.Applications | length > 0) (Explained in more detail in Enhancement 4), but basically just checks to make sure at least one integration (application) is loaded into the agent so that it can do "something".

Note: There are some gaps here in the Elastic Agent status command where not all issues are properly propagated to show a 100% correct agent health, but this would at least cover most cases.

Enhancement 4

This requires a bit weirder understanding of how the Elastic Agent can "break", but in theory, you can also provide a default liveness check to attempt to "fix" a broken pod.

Example command:

/opt/Elastic/Agent/elastic-agent status --output json | jq -e '.Status == 3 or .Status == 4 or (.Status == 2 and (.Applications | length == 0))'

It can be used as a way of telling the health of an agent.

Explanation of the command:

.Status == 3 Agent is in a degraded state
- This can happen for a number of reason
.Status == 4 Agent is in a failed state
- This can happen if an integration misses 2+ check-ins.
.Status == 2 and (.Applications | length == 0) Agent is in a healthy state, but has no integrations.
- This is a weird one, which I think is mainly an Agent bug, but is also a sign of a broken Agent. This issue only happens with Fleet managed agents, but an agent can be in a healthy state, but have no integrations (applications). In the real world, the likelihood of this being an intentional thing is extremely low, so I think the majority of time that something like this is come across it will be a sign of a "broken" agent. (Though depending on some weirdness, you could end up in a restart loop if the liveness check happens faster than the agent could apply any potential config).

Note: There are some gaps here in the Elastic Agent status command where not all issues are properly propagated to show a 100% correct agent health, but this would at least cover most cases.

Enhancement 5

This is a somewhat vague one as I'm not 100% sure, but a default startup check could potentially be provided here as well. Though it might be similar enough to the readiness check I provided that it might not be needed.

barkbay commented 2 years ago

Hi, thanks for opening this issue and your proposals.

I would like to have a better understanding of the 2 use cases you described, and understand where it would make sense to improve things (in Agent, in ECK, or both). Could you provide the steps to reproduce them ?

More specifically:

[...] one of the Elastic Agent Fleet Server pods that was receiving traffic was having its credentials rotated and therefore wasn't really in a "healthy" state, yet it was showing as healthy in Kubernetes.

How the Agent credentials have been rotated ?

If you are performing any sort of upgrade on a current deployment, or are creating a new deployment, you want to actually be sure that the changes you are making are working as intended.

Could you explain what is a "deployment" in this context ?

Thanks

BenB196 commented 2 years ago

Hi @barkbay,

Regarding:

[...] one of the Elastic Agent Fleet Server pods that was receiving traffic was having its credentials rotated and therefore wasn't really in a "healthy" state, yet it was showing as healthy in Kubernetes.

How the Agent credentials have been rotated ?

I believe this is being done by the ECK operator. (I could be wrong though). The way I came to this conclusion though is the following:

Saw a large portion of Fleet managed agents show as offline in Kibana
I inspected one of the "offline" agents at the source and saw that the Fleet server it was trying to connect to was not responding correctly.
I investigated the "active" Fleet server (the fleet server which Kubernetes was routing traffic to) and saw in the fleet server logs that it was getting a sort of unauthorized error when trying to connect to Elasticsearch.
After ~10-15 minutes the issue resolved itself. This led me to the conclusion that most likely the credentials that the fleet server was using had expired (something along the lines of an API Key which expired) or that the ECK operator was in the middle of rotating credentials.
- Either way the problem resolved itself without any interference from myself, so I'm assuming the ECK operator sorted it out. (Unless I'm missing some fundamental part of how the Fleet setup works)

Note: Only the Fleet server was having this issue, Beats and the Fleet Agents could still communicate and authenticate with Elasticsearch, so this led me to believe this wasn't an Elasticsearch issue.

If you are performing any sort of upgrade on a current deployment, or are creating a new deployment, you want to actually be sure that the changes you are making are working as intended.

Could you explain what is a "deployment" in this context ?

"Deployment" in this context refers to a Kubernetes; Deployment, Daemonset, Cronjob, Stateful Set, Pod, etc... (Basically, anywhere that Kubernetes surfaces the "health" or status of the Agent). Where as soon as you add a new Elastic Agent CRD to a cluster and the ECK operator applies it, the only time the "deployment" is in anything other than a healthy state is during the Image Pull process, as soon as the image has been pulled and the underlying pod spun up, the health of the pod is marked as healthy, even if the Agent is not "healthy".

elastic / cloud-on-k8s