aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 320 forks source link

[EKS] [request]: Automated Node Health Checking #928

Open marjamis opened 4 years ago

marjamis commented 4 years ago

Community Note

Tell us about your request This request is for automated health management of EKS worker nodes within a cluster via additional checks of a node to ensure it's an active and healthy member of a cluster and then enacting remediation steps. An example being, monitoring node conditions for any abnormalities and, if detected, complete an action, such as notifying admins and/or performing a step to try to fix the issue.

Primarily for EKS Managed Nodes but preferably something that could be extended into self managed nodes as well.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently, when there are nodes in a cluster there is health checking/remediation by the Kubernetes control plane for Pods, Deployments, etc. and there is health checking for the underlying EC2 instances via an ASG EC2 Health Check. However there is a missing link for checking the health of the node for it's usage within Kubernetes itself. For example, there are no integrated or automated checks for when the Kubelet dies, and can't be restarted, or other node conditions that are reporting an issue that would upset the health of the node within Kubernetes but not necessarily cause EC2 Health Check issues.

The goal is to have an automated system for EKS, and possibly integrated into the ASG, for a Kubernetes level health check of a node that was configurable and would perform custom actions on an issue occurring. For example, if a worker node was in the Not Ready state for a consecutive 10mins, deem this as an issue, report, and then automatically perform an action, such as have the ASG terminate this instance due to the failing Health Check and replace the node. Other actions could also be made available, such as trying to self heal the node rather than terminate it outright.

Implementing this currently would require creating a slew of additional components, either inside or outside of the cluster, to monitor, track, and perform actions and while possible this is additional work that could be automated as a part of EKS/AWS.

Many configuration and action possibilities could arise out of this Feature Request but the key component is being able to perform an action on a worker node that is not properly working as a node of the Kubernetes cluster.

Are you currently working around this issue? General tracking when issues occur and then manually intervening, such as terminating the EC2 instance for an ASG replacement.

There are tools, such as: https://github.com/kubernetes/node-problem-detector and it's associated Remedy systems: https://github.com/kubernetes/node-problem-detector#remedy-systems that could be run now but do require additional implementation/work and aren't directly integrated with EKS/AWS tooling/monitoring, such as for EKS Managed Nodes.

Additional context None

Attachments None

aaron-trout commented 3 years ago

This would be similar to the GKE "auto repair" feature on node pools I guess? I have seen a couple issues lately where a node dies in a way which does not get handled gracefully; i.e. node/pods are unreachable/dead but k8s does not reschedule pods on other nodes. So very keen for this feature!

stevehipwell commented 3 years ago

@aaron-trout not that it's health checking, but all EKS versions from v1.19 should have kubelet improvements that stop it being unable to re-connect to the API if there is an error. If you're still seeing "stuck" nodes you might also want to consider your kubelet reserved resource configuration, the EKS defaults are very low for ENI based pod allocation and are even lower when prefix mode is used to increase pod density.

jbilliau-rcd commented 2 years ago

We have this issue as well; nodes will die for whatever reason (kubelet) and a lot of folks dont like to, and in most cases dont have the access to, terminate the node in the aws console so it gets replaced. If EKS had a way of verifying the health of the kubelet and if unhealthy, terminate and replace node, that would be fantastic.

shay-berman commented 1 year ago

Any update here? Does EKS don't know to auto repair worker node automatically? (I believe GKE and AKS support that so i wonder when this will be implemented in EKS?)

armujahid commented 11 months ago

Noticed this issues again in which managed node group bottlerocket node was stuck in unknown status (for more than 6 hours) because of the kubelet crashing issue and I had to manually restart kubelet. EKS failed to self heal/replace that node in this case although everything was scheduled somewhere else thanks to karpenter that I was using. Details are posted here: https://github.com/bottlerocket-os/bottlerocket/issues/2512#issuecomment-1880056535

camaeel commented 11 months ago

You may give a try to https://github.com/dbschenker/node-undertaker/ to handle such cases.

marianafranco commented 10 months ago

EKS recommends the usage of node-problem-detector with some remedy system to drain and terminate the bad node: https://aws.github.io/aws-eks-best-practices/reliability/docs/dataplane/#run-node-problem-detector

Would be great if the node-problem-detector could be provided out of box.. or at least as an EKS addOn. This is already the case with GKE and AKS.

camaeel commented 10 months ago

I think cluster-autoscaler has one limitation - if nodes are in an ASG that has min=max then it won't terminate nodes that are not working properly. This case is addressed by https://github.com/dbschenker/node-undertaker/

jw-maynard commented 9 months ago

None of the workarounds here address an issue I ran into recently. We had a scale from 0 node group setup to handle intermittent CronJobs. The Node Group went to scale up the ASG from 0 -> 1 and a node was created and passed the EC2 health checks but kublet was crashing so it never checked into the cluster. While I'm a ware that a custom Lifecycle check could theoretically detect this it seems like something the Node Group should be able to auto detect. The ASG says one node is ready but the Node Group can't see the node in k8s. After some amount of time kill the node and let the ASG start a new one.

kevincantu commented 5 months ago

Yeah, I want some kind of custom health-check for EKS node health on the instance, so if one's failed to connect to EKS, it can mark itself Unhealthy and let the ASG replace it.