Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 560 forks source link

kube-dns pods keep crashing #3625

Closed jamoham closed 5 years ago

jamoham commented 6 years ago

Is this a request for help?: YES

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: v0.12.3

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kubernetes 1.7.9

What happened: DNS issues in the cluster - kube-dns pods keep crashing and need to be explicitly restarted. Frequency of issue: 1-2 times per week

What you expected to happen: kube-dns pods should not crash

How to reproduce it (as minimally and precisely as possible): We are unable to determine what leads to kube-dns pods crashing

Details:

Cluster details:

We are currently seeing the following DNS issues in our cluster In our applicatin logs : socket.gaierror: [Errno -3] Temporary failure in name resolution In kube-dns logs : skydns: failure to forward request "read udp 30.0.0.238:52356->168.63.129.16:53: i/o timeout In dnsmasq logs : dnsmasq[21]: Maximum number of concurrent DNS queries reached (max: 150)

We wanted to seek your guidance on how to debug and fix this. Questions:

Question 1: Should we be setting auto-scaling for kube-dns?

kube-dns does not scale with the cluster - There are just 2 kube-dns pods in all our clusters. Should we be setting up DNS service auto-scaling?

Question 2: Are there any best practices on tuning dns-masq?

Currently running with the following defaults: Command:

/dnsmasq-nanny -v=2 -logtostderr -configDir=/kube-dns-config -restartDnsmasq=true -- -k --cache-size=1000 --no-resolv --server=127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10
Image being used:
k8s-dns-dnsmasq-nanny-amd64:1.14.5

Investigating more into dnsmasq:

We keep seeing the following log: dnsmasq[21]: Maximum number of concurrent DNS queries reached (max: 150)

150 concurrent DNS queries is a lot! (considering that caching is setup) - This means that over 150 unique non-cached domains are being requested concurrently. Something does not seem right here. From this application, we are only calling limited external domains.

We need your pointers on investigating this further.

jackfrancis commented 6 years ago

@jamoham are you seeing the dnsmasq max concurrency errors exactly during the time that kube-dns is crashing?

jamoham commented 6 years ago

@jackfrancis We have lost the logs from the crashed pod as we restarted it - so I am unable to confirm that they occur exactly at the time of crash but they occur in the same time range (I will definitely check this the next time the pod crashes)

tariq1890 commented 6 years ago

Hey, @jamoham were you able to replicate this issue again? Any reason why you are not using the latest version of acs-engine?

jamoham commented 6 years ago

@tariq1890 We keep hitting this issue occasionally and we have to keep restarting the kube-dns pods.

We had set this cluster up around 9 months ago and that is why we have used the older acs-engine version.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.