Closed jamoham closed 5 years ago
@jamoham are you seeing the dnsmasq max concurrency errors exactly during the time that kube-dns is crashing?
@jackfrancis We have lost the logs from the crashed pod as we restarted it - so I am unable to confirm that they occur exactly at the time of crash but they occur in the same time range (I will definitely check this the next time the pod crashes)
Hey, @jamoham were you able to replicate this issue again? Any reason why you are not using the latest version of acs-engine?
@tariq1890 We keep hitting this issue occasionally and we have to keep restarting the kube-dns pods.
We had set this cluster up around 9 months ago and that is why we have used the older acs-engine version.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.
Is this a request for help?: YES
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: v0.12.3
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kubernetes 1.7.9
What happened: DNS issues in the cluster -
kube-dns
pods keep crashing and need to be explicitly restarted. Frequency of issue: 1-2 times per weekWhat you expected to happen:
kube-dns
pods should not crashHow to reproduce it (as minimally and precisely as possible): We are unable to determine what leads to
kube-dns
pods crashingDetails:
Cluster details:
We are currently seeing the following DNS issues in our cluster In our applicatin logs :
socket.gaierror: [Errno -3] Temporary failure in name resolution
In kube-dns logs :skydns: failure to forward request "read udp 30.0.0.238:52356->168.63.129.16:53: i/o timeout
In dnsmasq logs :dnsmasq[21]: Maximum number of concurrent DNS queries reached (max: 150)
We wanted to seek your guidance on how to debug and fix this. Questions:
Question 1: Should we be setting auto-scaling for kube-dns?
kube-dns
does not scale with the cluster - There are just 2kube-dns
pods in all our clusters. Should we be setting up DNS service auto-scaling?Question 2: Are there any best practices on tuning
dns-masq
?Currently running with the following defaults: Command:
Investigating more into dnsmasq:
We keep seeing the following log:
dnsmasq[21]: Maximum number of concurrent DNS queries reached (max: 150)
150 concurrent DNS queries is a lot! (considering that caching is setup) - This means that over 150 unique non-cached domains are being requested concurrently. Something does not seem right here. From this application, we are only calling limited external domains.
We need your pointers on investigating this further.