aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[EKS] CoreDNS issues after upgrading to 1.29 #2298

Open rjtshrm opened 8 months ago

rjtshrm commented 8 months ago

We recently upgraded from version 1.25 to 1.29. Up until version 1.28, we didn't have any issues. However, after upgrading to version 1.29, our applications suddenly started throwing errors stating that they can't resolve domain names.

Upon inspecting the logs, we discovered that CoreDNS had encountered some issues. Somehow, with EKS version 1.29, it can't reach the Kubernetes endpoint. We are using Terraform to update the cluster.

Is there something that changed with EKS version 1.29 related to CoreDNS, CNI, or networking in general? Everything worked fine up to version 1.28. Apart from changing the version in Terraform, all configurations remain the same.

I have included the CoreDNS logs for both version 1.28 and 1.29. Any help would be greatly appreciated.

CoreDNS logs with EKS 1.28

.:53
xxxxx.lan.:53
xxxxx.internal.:53
xxxxx.internal.:53
xxxxx.cloud.:53
[INFO] plugin/reload: Running configuration SHA512 = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
CoreDNS-1.10.1
linux/amd64, go1.21.5, 34742fdd
[INFO] 172.31.141.181:51238 - 25614 "A IN domain1.c.xxxxx.internal. udp 70 false 512" NOERROR qr,rd,ra 138 0.014623223s
[INFO] 172.31.141.181:51238 - 5896 "AAAA IN domain2.c.xxxxx.internal. udp 70 false 512" NOERROR qr,rd,ra 175 2.018777708s
[INFO] 172.31.141.181:38271 - 10889 "A IN domain3.c.xxxxx.internal. udp 70 false 512" NOERROR qr,rd,ra 138 0.013623023s

CoreDNS logs with EKS 1.29

[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
CoreDNS-1.11.1
linux/amd64, go1.21.5, e8fa22a0
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.100.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1010975873]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (26-Feb-2024 15:40:26.542) (total time: 30000ms):
Trace[1010975873]: ---"Objects listed" error:Get "https://10.100.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout 30000ms (15:40:56.543)
Trace[1010975873]: [30.000813318s] [30.000813318s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.100.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Service: Get "https://10.100.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1482381721]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (26-Feb-2024 15:40:26.541) (total time: 30004ms):
Trace[1482381721]: ---"Objects listed" error:Get "https://10.100.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout 30004ms (15:40:56.546)
Trace[1482381721]: [30.004177736s] [30.004177736s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.100.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.100.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[748359720]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (26-Feb-2024 15:40:26.541) (total time: 30005ms):
Trace[748359720]: ---"Objects listed" error:Get "https://10.100.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.100.0.1:443: i/o timeout 30005ms (15:40:56.547)
Trace[748359720]: [30.005153615s] [30.005153615s] END
0xRIZE commented 8 months ago

We have an enterprise support contract with AWS and contacted them when we noticed that the first DNS calls from pods started breaking since 1.29. Long story short, from the control plane logs of the K8S cluster they concluded the following:

Answer: Unfortunately, we won’t be able to provide an ETA for the fix. This issue only persists for initial 60s that is the initial delay that is currently happening for latest fargate pods. So, currently the temperory fix is to add an init container and make the init container sleep for 60s.

Answer: According to the troubleshooting done by internal team, we found that there is a startup delay for fargate pods of approximately 60s. Unfortunately, I won’t be able to provide more information on this as this is internal only.

Possibly relates to https://github.com/aws/containers-roadmap/issues/2281 too ..

Hope that helps you ...

veekaly commented 8 months ago

I've encountered the same behavior. Here is a temporary workaround until EKS team fixes the problem.

Add the below initContainer spec to your deployment/pod manifest. It keeps the pod in Init state until your desired FQDN is resolving.

spec:
  initContainers:
  - name: init-service
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup <FQDN-your-app-is-trying>; do echo waiting for DNS Resoluton; sleep 2; done"]
Kampe commented 8 months ago

Seeing the same issues here with 1.29

ilpianista commented 7 months ago

Same here with Ubuntu 22.04 EKS AMis and 1.29 :thinking:

It isn't only DNS. CoreDNS doesn't report any problem, while metrics-server fails to start because cannot reach the controller endpoint https://10.100.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 10.100.0.1:443: i/o timeout

sameeraksc commented 7 months ago

we also experienced this issue with 1.29. when there is a high load, new servers join to to cluster and we are getting these errors from the pods of new servers. It's making a big impact on our production traffic. production.ERROR: php_network_getaddresses: getaddrinfo for .....cache.amazonaws.com failed: Temporary failure in name resolution This is a critical issue. They should prioritize this :-(

achevuru commented 7 months ago

@sameeraksc EKS on EC2? or Fargate? Issue that is being referenced above is specific to EKS Fargate and should have nothing to do with high load.

sameeraksc commented 7 months ago

It's fargate in EKS @achevuru . yeah, it's nothing to do with the high load. I mean with the High load, deployments in the cluster getting auto-scaled. newly joined fargate pods randomly having this Issue.

ilpianista commented 7 months ago

Same here with Ubuntu 22.04 EKS AMis and 1.29 🤔

FYI, I cannot reproduce it on Ubuntu 20.04 EKS with 1.29. I've opened https://bugs.launchpad.net/cloud-images/+bug/2060203 to track this on their side.

sameeraksc commented 7 months ago

@ilpianista we are having this issue in amazon linux 2 eks nodes as well.

rjtshrm commented 5 months ago

Since the time I raised the issue, I downgraded my cluster. Wanted to know if the issue still persists or resolved.

sotiriougeorge commented 5 months ago

I have just upgraded my setup:

My workloads come mainly from deploying Helm charts. With a preliminary cursory examination I don't see any DNS issue, which concerns me given the number of thumbs up and participation in the present ticket.

Is there a particular use case that has been identified to present the DNS issues, or does anyone know whether it's been resolved?

Thank you in advance

philipp-durrer-jarowa commented 5 months ago

Same issues here with an AKS cluster after upgrading to 1.29.4 using a nodepool with Node image version AKSUbuntu-2204gen2containerd-202405.27.0 Funilly enough only observed on one out of 12 AKS clusters. Any help would be appreciated.

Stopping and starting the affected cluster actually fixed it... /shrug

hitsub2 commented 2 months ago

Same issue here. Any update for this?

thatfatguypat commented 4 weeks ago

I found my issue to be due to the 1.29.x versions of the kube-proxy add-on. Once a coredns pod was removed from rotation, all the iptables rules were properly updated by kube-proxy but an entry in the conntrack table pointing to the old coredns pod's IP was left behind. After reverting kube-proxy to 1.28.8-eksbuild.5 the problems went away.

See this issue for more details. A fix was implemented here and a backported to Kubernetes 1.29.9.

JacobHenner commented 2 weeks ago

The EKS documentation lists v1.30.3-eksbuild.5 as the latest version of kube-proxy for 1.30, but this issue isn't fixed for 1.30 until 1.30.6. I see that v1.30.6-eksbuild.1 is available in ECR - should the documentation be updated to recommend this version instead? If not, what action should users affected by this issue take?

This is a fairly significant problem for us, with failures impacting several clusters of ours in noticeable ways. I'm surprised guidance has yet to be published for the issue.