[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State

petewilcock commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request Bug

Which service(s) is this request for? EKS Managed Add-Ons

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? In a scenario where an EKS cluster has only tainted nodes available, the CoreDNS add-on cannot deploy successfully. Terraform will, for e.g., time out after 20 minutes of failing to schedule, and the deployment will eventually record as failed in the console.

However, if I begin to deploy the CoreDNS add-on, and then immediately patch the deployment with my tolerations, the service will schedule successfully but the EKS console will report the add-on as 'Degraded' even after the new ReplicaSet has deployed successfully.

It appears that EKS will periodically try a rolling redeployment on a degraded add-on? But this just makes the pods unschedulable again and the cycle repeats.

The ultimate fix (and embedded feature request) here would be the ability to stipulate tolerations as part of the add-on configuration.

Are you currently working around this issue? With difficulty - Terraform sees the degraded Add-on state as tainted and looks to destroy it.

One workaround is to supply an untainted node to allow the initial deployment to stabilize, and then patch it.

Another seems to be to 'catch it in the act' of the rolling redeploy and patch it again whilst it is in progress. This seems to make it stabilize but is painfully manual.

Additional context

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

pierluigilenoci commented 3 years ago

We have been hit by the same problem. 😞

Solutions proposed by AWS support:

remove the taints
change taints to PreferNoSchedule
do not use managed addons

Obviously, none of the solutions are really acceptable. 🤦🏻 If someone decides to use taints on the nodes it is because they need them.

pierluigilenoci commented 3 years ago

It appears that CoreDNS has been hit several times by taints related issues:

And not just CoreDNS:

https://github.com/kubernetes/autoscaler/issues/3802

And not just on AWS EKS:

Unfortunately there is no way around this as the tolerations field is fully managed by EKS. Ref: https://docs.aws.amazon.com/eks/latest/userguide/add-ons-configuration.html

I've waited a long time to see this feature released and see it so lame leaves me a little bitter in my mouth.

stevehipwell commented 3 years ago

I'd really like to see some documentation for the CriticalAddonsOnly taint and it's use in EKS; this appears to be undocumented behaviour.

The best reference I can find to it in the Kubernetes codebase is https://github.com/kubernetes/kubernetes/pull/101966#issuecomment-840788605, where it's described as "an artifact of the old addon manager under /cluster" and said to "i don't think such a taint is actively used anymore".

stevehipwell commented 3 years ago

@pierluigilenoci having EKS install the addons is a bad enough experience; allowing EKS to "manage" the addons is just asking for trouble. I'm very happy that this behaviour wasn't forced on us (see AKS), but I'm desperately waiting on https://github.com/aws/containers-roadmap/issues/923 to be able to provision a bare cluster.

pierluigilenoci commented 3 years ago

Instead, I'm for the team "the more it is managed by the cloud provider, the better". 💪🏻 I prefer to focus on other things and let AWS / Azure engineers do their work. 😜

I agree that the documentation is missing but I think it is linked to the fact that the AddOns management is (IMHO) still "not production ready" despite being released. So I think they still don't have a clear direction as to where they want to take the "product".

In reality, both platforms (AKS / EKS) are still a bit immature for my taste. But I think it's an inherent problem with Kubernetes itself that it's a product in the making. And obviously, for those who want to provide a product, it is not easy to keep up. I have to give them credit because in the last 2-3 years they have improved a lot and therefore we have decided to migrate from self-managed clusters to managed clusters.

P.S.: I found this comment enlightening https://github.com/aws/containers-roadmap/issues/923#issuecomment-756183905

stevehipwell commented 3 years ago

Instead, I'm for the team "the more it is managed by the cloud provider, the better". 💪🏻 I prefer to focus on other things and let AWS / Azure engineers do their work. 😜

@pierluigilenoci you might be interested in https://github.com/aws/containers-roadmap/issues/1559 then?

pierluigilenoci commented 3 years ago

@stevehipwell you've had my like and I look forward to finding you a few more. But I have little hope that AWS will consider the request.

datadoggers commented 1 year ago

if configuration of cluster and node group is seems fine then once try changing type of instance (like if we have selected t3.medium in node group then try t2.medium or any) then verify with --- kubectl get pods -n=kube-system | grep coredns are they running or not . in my case it was solved

sftim commented 1 year ago

BTW, CriticalAddonsOnly is not an official Kubernetes taint; if it were, you'd see it listed in [Well-Known Labels, Annotations and Taints](Well-Known Labels, Annotations and Taints) and it would be prefixed, eg with kubernetes.io/.

It's an architectural bug that the cluster-autoscaler tries to use a private taint. Ideally, folks help fix the cluster autoscaler.

AndreiAtMP commented 3 weeks ago

Hello, still hitting this problem with Terraform.

resource "aws_eks_addon" "coredns" {
  cluster_name                = aws_eks_cluster.name.name
  addon_name                  = "coredns"
  depends_on = [aws_eks_fargate_profile.kube-system]  
}

Stacking on degraded state.

Any updates, please?

Best regards, Andrei

aws / containers-roadmap

[EKS Add-On] [CoreDNS]: Patched Add-On never recovers from 'Degraded' State #1389

Community Note