Coredns is in running state but not functioning properly as expected.

panchm commented 1 year ago

We are observing multiple issues with the coredns on the below mentioned environment

Environment:

AWS Region: us-east-1
Instance Type(s): m5.4xlarge and m5.2xlarge
EKS Platform version: eks.5
Kubernetes version: v1.23.7
AMI Version: (ami-036b6ef0ea363fec1) ubuntu-eks/k8s_1.23/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220901
Kernel : Linux ip-10-0-1-176 5.15.0-1019-aws #23~20.04.1-Ubuntu SMP Thu Aug 18 03:20:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node): Not found
[Amazon VPC CNI] Enable pod networking within your cluster. Category: networking Status: Active Version: v1.10.4-eksbuild.1
[CoreDNS] Enable service discovery within your cluster. Category: networking Status: Active Version: v1.8.7-eksbuild.2 IAM Role Inherited from nodeEnable service discovery within your cluster.
[kube-proxy] Enable service networking within your cluster. Category: networking Status: Active Version: v1.23.7-eksbuild.1 IAM Role Inherited from node

1. After the node got rebooted and came up, the pods running on that node are not able to communicate as they failed to resolve service names.

2. Sometimes we don't see any errors on the coredns logs and still, pods are failed to communicate as they failed to resolve service names.

3. We observed one strange behavior i.e. sometimes application pods are running properly if the coredns are not running part of that same node.

We have tried to follow the below steps as well, but does not help much and it helps sometime with restarted the coredns pods. https://aws.amazon.com/premiumsupport/knowledge-center/eks-dns-failure/

Note: We took precautions with the coredns pods running on different nodes all the time.

Is this a known issue? We have not altered any configuration with respect to the Kube-system pods. Are we missing some other configuration required as part of the EKS nodegroup or any other configuration?

Sometimes we don't see any errors but the application pods failed to communicate

$ kubectl logs --follow -n kube-system --selector 'k8s-app=kube-dns'
.:53
[INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94
CoreDNS-1.8.7
linux/amd64, go1.17.7, d433a3f2
.:53
[INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94
CoreDNS-1.8.7
linux/amd64, go1.17.7, d433a3f2

Sometimes we observe these issues which says some plugin errors

kubectl logs --follow -n kube-system --selector 'k8s-app=kube-dns'
.:53
[INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94
CoreDNS-1.8.7
linux/amd64, go1.17.7, d433a3f2
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. A: read udp 10.0.173.102:39671->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. A: read udp 10.0.173.102:33958->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:42392->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:39358->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:60994->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:59687->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 kafka.kafka-nextgen-dev1.svc.cluster.local.ec2.internal. AAAA: read udp 10.0.173.102:34256->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:33075->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 kafka.kafka-nextgen-dev1.svc.cluster.local.ec2.internal. AAAA: read udp 10.0.173.102:37118->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:35190->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:53705->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:33519->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:52150->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. A: read udp 10.0.173.102:45225->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. AAAA: read udp 10.0.173.102:44717->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. AAAA: read udp 10.0.173.102:50949->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. A: read udp 10.0.173.102:52404->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. A: read udp 10.0.173.102:47961->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. AAAA: read udp 10.0.173.102:42074->10.0.0.2:53: i/o timeout

bryanasdev000 commented 1 year ago

@panchm for curiosity, the nodes where CoreDNS are that fully works, are outside us-east-1c?

I had a bunch of repeating issues the last week with "random" DNS issues at this AZ (and a few, rare ones at us-east-1b), although it can be a lot of things, seeing this issue make me think, in my case, for now, I just dropped us-east-1c from our workload.

Also, you can enable log to see if you can find any SERVFAIL? (If you have a log stack and a big env, prepare for avalanche :P) Or if you have a Prometheus setup, you can monitor CoreDNS and use these rules (https://github.com/povilasv/coredns-mixin) to check which pod are going havoc.

In my case any CoreDNS pod at us-east-1c will randomly fail with SERVFAIL, be our own entries at Route53, AWS Services (like sqs, sns and such) and others .com or com.br entries.

If I do ssh in affected nodes I also get the same behavior, randomly failing with SERVFAIL, so I think the root issue may not be CoreDNS, but I also did not find anyone else speaking of this problem and the last time that I have seen an issue like this, it was global (https://twitter.com/awssupport/status/1186862408893194241) or a bit more specific (https://news.ycombinator.com/item?id=31525267).

cartermckinnon commented 1 year ago

This repository tracks issues with the EKS-Optimized AMI based on Amazon Linux 2; the Ubuntu-based AMI is maintained by Canonical. Please engage AWS Support for this issue.

awslabs / amazon-eks-ami