Open rjtshrm opened 8 months ago
We have an enterprise support contract with AWS and contacted them when we noticed that the first DNS calls from pods started breaking since 1.29. Long story short, from the control plane logs of the K8S cluster they concluded the following:
Answer: Unfortunately, we won’t be able to provide an ETA for the fix. This issue only persists for initial 60s that is the initial delay that is currently happening for latest fargate pods. So, currently the temperory fix is to add an init container and make the init container sleep for 60s.
Answer: According to the troubleshooting done by internal team, we found that there is a startup delay for fargate pods of approximately 60s. Unfortunately, I won’t be able to provide more information on this as this is internal only.
Possibly relates to https://github.com/aws/containers-roadmap/issues/2281 too ..
Hope that helps you ...
I've encountered the same behavior. Here is a temporary workaround until EKS team fixes the problem.
Add the below initContainer spec to your deployment/pod manifest. It keeps the pod in Init
state until your desired FQDN is resolving.
spec:
initContainers:
- name: init-service
image: busybox:1.28
command: ['sh', '-c', "until nslookup <FQDN-your-app-is-trying>; do echo waiting for DNS Resoluton; sleep 2; done"]
Seeing the same issues here with 1.29
Same here with Ubuntu 22.04 EKS AMis and 1.29 :thinking:
It isn't only DNS. CoreDNS doesn't report any problem, while metrics-server
fails to start because cannot reach the controller endpoint https://10.100.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 10.100.0.1:443: i/o timeout
we also experienced this issue with 1.29. when there is a high load, new servers join to to cluster and we are getting these errors from the pods of new servers. It's making a big impact on our production traffic.
production.ERROR: php_network_getaddresses: getaddrinfo for .....cache.amazonaws.com failed: Temporary failure in name resolution
This is a critical issue. They should prioritize this :-(
@sameeraksc EKS on EC2? or Fargate? Issue that is being referenced above is specific to EKS Fargate and should have nothing to do with high load.
It's fargate in EKS @achevuru . yeah, it's nothing to do with the high load. I mean with the High load, deployments in the cluster getting auto-scaled. newly joined fargate pods randomly having this Issue.
Same here with Ubuntu 22.04 EKS AMis and 1.29 🤔
FYI, I cannot reproduce it on Ubuntu 20.04 EKS with 1.29. I've opened https://bugs.launchpad.net/cloud-images/+bug/2060203 to track this on their side.
@ilpianista we are having this issue in amazon linux 2 eks nodes as well.
Since the time I raised the issue, I downgraded my cluster. Wanted to know if the issue still persists or resolved.
I have just upgraded my setup:
My workloads come mainly from deploying Helm charts. With a preliminary cursory examination I don't see any DNS issue, which concerns me given the number of thumbs up and participation in the present ticket.
Is there a particular use case that has been identified to present the DNS issues, or does anyone know whether it's been resolved?
Thank you in advance
Same issues here with an AKS cluster after upgrading to 1.29.4 using a nodepool with Node image version AKSUbuntu-2204gen2containerd-202405.27.0 Funilly enough only observed on one out of 12 AKS clusters. Any help would be appreciated.
Stopping and starting the affected cluster actually fixed it... /shrug
Same issue here. Any update for this?
I found my issue to be due to the 1.29.x versions of the kube-proxy add-on. Once a coredns pod was removed from rotation, all the iptables rules were properly updated by kube-proxy but an entry in the conntrack table pointing to the old coredns pod's IP was left behind. After reverting kube-proxy to 1.28.8-eksbuild.5 the problems went away.
See this issue for more details. A fix was implemented here and a backported to Kubernetes 1.29.9.
The EKS documentation lists v1.30.3-eksbuild.5
as the latest version of kube-proxy for 1.30, but this issue isn't fixed for 1.30 until 1.30.6. I see that v1.30.6-eksbuild.1
is available in ECR - should the documentation be updated to recommend this version instead? If not, what action should users affected by this issue take?
This is a fairly significant problem for us, with failures impacting several clusters of ours in noticeable ways. I'm surprised guidance has yet to be published for the issue.
We recently upgraded from version 1.25 to 1.29. Up until version 1.28, we didn't have any issues. However, after upgrading to version 1.29, our applications suddenly started throwing errors stating that they can't resolve domain names.
Upon inspecting the logs, we discovered that CoreDNS had encountered some issues. Somehow, with EKS version 1.29, it can't reach the Kubernetes endpoint. We are using Terraform to update the cluster.
Is there something that changed with EKS version 1.29 related to CoreDNS, CNI, or networking in general? Everything worked fine up to version 1.28. Apart from changing the version in Terraform, all configurations remain the same.
I have included the CoreDNS logs for both version 1.28 and 1.29. Any help would be greatly appreciated.
CoreDNS logs with EKS 1.28
CoreDNS logs with EKS 1.29