Open BenB196 opened 2 years ago
Hi, thanks for opening this issue and your proposals.
I would like to have a better understanding of the 2 use cases you described, and understand where it would make sense to improve things (in Agent, in ECK, or both). Could you provide the steps to reproduce them ?
More specifically:
[...] one of the Elastic Agent Fleet Server pods that was receiving traffic was having its credentials rotated and therefore wasn't really in a "healthy" state, yet it was showing as healthy in Kubernetes.
How the Agent credentials have been rotated ?
If you are performing any sort of upgrade on a current deployment, or are creating a new deployment, you want to actually be sure that the changes you are making are working as intended.
Could you explain what is a "deployment" in this context ?
Thanks
Hi @barkbay,
Regarding:
[...] one of the Elastic Agent Fleet Server pods that was receiving traffic was having its credentials rotated and therefore wasn't really in a "healthy" state, yet it was showing as healthy in Kubernetes.
How the Agent credentials have been rotated ?
I believe this is being done by the ECK operator. (I could be wrong though). The way I came to this conclusion though is the following:
Note: Only the Fleet server was having this issue, Beats and the Fleet Agents could still communicate and authenticate with Elasticsearch, so this led me to believe this wasn't an Elasticsearch issue.
If you are performing any sort of upgrade on a current deployment, or are creating a new deployment, you want to actually be sure that the changes you are making are working as intended.
Could you explain what is a "deployment" in this context ?
"Deployment" in this context refers to a Kubernetes; Deployment, Daemonset, Cronjob, Stateful Set, Pod, etc... (Basically, anywhere that Kubernetes surfaces the "health" or status of the Agent). Where as soon as you add a new Elastic Agent CRD to a cluster and the ECK operator applies it, the only time the "deployment" is in anything other than a healthy state is during the Image Pull process, as soon as the image has been pulled and the underlying pod spun up, the health of the pod is marked as healthy, even if the Agent is not "healthy".
Proposal
By default, Elastic Agent CRDs don't have any health checks (Readiness, Liveness, Startup), while this makes some sense, but I think there is room for at least some documentation improvements (if not other improvements as well).
Why does this matter?
If you don't have any checks, then Kubernetes will always just assume that the pod is healthy and therefore route traffic to it.
Use Case 1; Bad Pod in the Bunch
I recently noticed this as an issue on one deployment where a majority of my Elastic Agents would go unhealthy for a good period of time (~15-20 minutes). After investigating further, I determined the issue to be that one of the Elastic Agent Fleet Server pods that was receiving traffic was having its credentials rotated and therefore wasn't really in a "healthy" state, yet it was showing as healthy in Kubernetes.
This led me to discover that there are in fact no default health checks for Elastic Agents by default.
Use Case 2; Upgrades/New Deployment
If you are performing any sort of upgrade on a current deployment, or are creating a new deployment, you want to actually be sure that the changes you are making are working as intended. Currently, with the defaults, if you were to do an upgrade or a new deployment, it will always be "healthy" from the Kubernetes perspective. This could be bad if the user doesn't know that there are no default health checks, and think that everything is actually healthy and continue about their day, until at some point they realize that this is not the case.
Enhancement 1
In the documentation for the Elastic Agent (Fleet managed + Standalone), add a note/section stating that the user needs to define their own health checks. While this might seem somewhat obvious, it caught me somewhat off guard, as a lot of the other resources that the ECK operator manages come with default checks, so I figured why wouldn't the Elastic Agent?
Enhancement 2
Document some good starting/base checks that people can build from.
While I understand that the Elastic Agent is a somewhat hard thing to get right when it comes to checks as things can be very dynamic (policies changing, ports are user configuration, etc...) and "healthy" might very depending on this, I think that providing at least some examples of checks would help new user who may not be as familiar with Kubernetes, at least get started with a good/"healthy" configuration.
Enhancement 3
Provide at least a basic default readiness check. The Elastic Agent has a way of checking its health
/opt/Elastic/Agent/elastic-agent status
(it may also have an API?, I'm not too sure though). It would be nice if there was a basic, default readiness check that looked at the health of the agent and made sure that it was actually "healthy".Example command:
It can be used as a way of telling that an Agent is in a healthy state and ready for use.
Explanation of the command:
.Status == 2
Agent is in a healthy state(.Applications | length > 0)
(Explained in more detail in Enhancement 4), but basically just checks to make sure at least one integration (application) is loaded into the agent so that it can do "something".Note: There are some gaps here in the Elastic Agent status command where not all issues are properly propagated to show a 100% correct agent health, but this would at least cover most cases.
Enhancement 4
This requires a bit weirder understanding of how the Elastic Agent can "break", but in theory, you can also provide a default liveness check to attempt to "fix" a broken pod.
Example command:
It can be used as a way of telling the health of an agent.
Explanation of the command:
.Status == 3
Agent is in a degraded state.Status == 4
Agent is in a failed state.Status == 2 and (.Applications | length == 0)
Agent is in a healthy state, but has no integrations.Note: There are some gaps here in the Elastic Agent status command where not all issues are properly propagated to show a 100% correct agent health, but this would at least cover most cases.
Enhancement 5
This is a somewhat vague one as I'm not 100% sure, but a default startup check could potentially be provided here as well. Though it might be similar enough to the readiness check I provided that it might not be needed.