awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.44k stars 1.15k forks source link

Coredns is in running state but not functioning properly as expected. #1150

Closed panchm closed 1 year ago

panchm commented 1 year ago

We are observing multiple issues with the coredns on the below mentioned environment

Environment:

1. After the node got rebooted and came up, the pods running on that node are not able to communicate as they failed to resolve service names.

2. Sometimes we don't see any errors on the coredns logs and still, pods are failed to communicate as they failed to resolve service names.

3. We observed one strange behavior i.e. sometimes application pods are running properly if the coredns are not running part of that same node.

We have tried to follow the below steps as well, but does not help much and it helps sometime with restarted the coredns pods. https://aws.amazon.com/premiumsupport/knowledge-center/eks-dns-failure/

Note: We took precautions with the coredns pods running on different nodes all the time.

Is this a known issue? We have not altered any configuration with respect to the Kube-system pods. Are we missing some other configuration required as part of the EKS nodegroup or any other configuration?

Sometimes we don't see any errors but the application pods failed to communicate

$ kubectl logs --follow -n kube-system --selector 'k8s-app=kube-dns'
.:53
[INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94
CoreDNS-1.8.7
linux/amd64, go1.17.7, d433a3f2
.:53
[INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94
CoreDNS-1.8.7
linux/amd64, go1.17.7, d433a3f2

Sometimes we observe these issues which says some plugin errors

kubectl logs --follow -n kube-system --selector 'k8s-app=kube-dns'
.:53
[INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94
CoreDNS-1.8.7
linux/amd64, go1.17.7, d433a3f2
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. A: read udp 10.0.173.102:39671->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. A: read udp 10.0.173.102:33958->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:42392->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:39358->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:60994->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:59687->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 kafka.kafka-nextgen-dev1.svc.cluster.local.ec2.internal. AAAA: read udp 10.0.173.102:34256->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:33075->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 kafka.kafka-nextgen-dev1.svc.cluster.local.ec2.internal. AAAA: read udp 10.0.173.102:37118->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:35190->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:53705->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:33519->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5155737311967868687.6477139177168497870. HINFO: read udp 10.0.173.102:52150->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. A: read udp 10.0.173.102:45225->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. AAAA: read udp 10.0.173.102:44717->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. AAAA: read udp 10.0.173.102:50949->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com.ec2.internal. A: read udp 10.0.173.102:52404->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. A: read udp 10.0.173.102:47961->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 sts.us-east-1.amazonaws.com. AAAA: read udp 10.0.173.102:42074->10.0.0.2:53: i/o timeout
bryanasdev000 commented 1 year ago

@panchm for curiosity, the nodes where CoreDNS are that fully works, are outside us-east-1c?

I had a bunch of repeating issues the last week with "random" DNS issues at this AZ (and a few, rare ones at us-east-1b), although it can be a lot of things, seeing this issue make me think, in my case, for now, I just dropped us-east-1c from our workload.

Also, you can enable log to see if you can find any SERVFAIL? (If you have a log stack and a big env, prepare for avalanche :P) Or if you have a Prometheus setup, you can monitor CoreDNS and use these rules (https://github.com/povilasv/coredns-mixin) to check which pod are going havoc.

In my case any CoreDNS pod at us-east-1c will randomly fail with SERVFAIL, be our own entries at Route53, AWS Services (like sqs, sns and such) and others .com or com.br entries.

If I do ssh in affected nodes I also get the same behavior, randomly failing with SERVFAIL, so I think the root issue may not be CoreDNS, but I also did not find anyone else speaking of this problem and the last time that I have seen an issue like this, it was global (https://twitter.com/awssupport/status/1186862408893194241) or a bit more specific (https://news.ycombinator.com/item?id=31525267).

cartermckinnon commented 1 year ago

This repository tracks issues with the EKS-Optimized AMI based on Amazon Linux 2; the Ubuntu-based AMI is maintained by Canonical. Please engage AWS Support for this issue.