Closed chweidling closed 5 years ago
Confirmed that 1803 nodes can't resolve using the service IP for kubedns at all.
On 1709 + the May 2018 update (KB4103727) we're also seeing more issues than usual with other service IPs in our cluster (not just kubedns.)
@sam-cogan 168.63.129.16
is the default Azure DNS. My guess is the pods are inheriting the DNS from the Node. I am having the same issue with 1803
The DNS issue does not seem unique to Windows (#2999, #2880) as of a few days ago. I am not able to access external address from windows or linux pods with the latest release.
Update: I had made a mistake on the external calls, which are working, but DNS is not resolving internal for windows nodes as @sam-cogan pointed out unless I go specify the pod dns:
$nslookup whoami 10.0.0.10
DNS request timed out.
timeout was 2 seconds.
$nslookup whoami.default.svc.cluster.local 10.0.0.10
DNS request timed out.
$nslookup whoami.default.svc.cluster.local 10.240.0.22 (dns pod ip)
Server: kube-dns-v20-59b4f7dc55-52kd5
Address: 10.240.0.22
Non-authoritative answer:
Name: whoami.default.svc.cluster.local
Address: 10.0.136.90
acs-engine 18.8, this still is an issue. Some containers can reach the kube-dns and some cannot. AKS has moved to GA and still no windows containers support. Does Kubernetes in Azure actually work. Incredibly frustrating experience for last 6 months trying to convince people this platform will work reliably.
@SteveCurran technically AKS has nothing to do with this (even though it's using ACS engine behind the scenes). But overall yeah. This is a trainwreck. And it's getting worse.
It does seem to be getting worse, and inconsistently so. I have a windows init container that works fine calling out to the internet, then the main container it spawns can't resolve anything, it makes no sense.
Any updates on this?
Since moving to asc engine release 0.19 and using 1803 images this seems to have improved significantly. DNS is resolving as expected and had stayed that way for some time.
I have moved to 19.3 and still using 1709 and I am seeing much more stability. Still using the DNS addresses to the individual DNS pods. The DNS server seems to get overwhelmed when trying to push many deployments at once. If I push deployments slowly then all is well.
@SteveCurran how did you move to 19.3? I have a cluster that I would like to upgrade and I am already running Kubernetes 1.11.0, the upgrade command is a no-op as there is nothing to upgrade. 0.19.2 seems to have some networing changes that I am hoping will solve my current issues but am unsure how to actually do this. I could always drop & re-create a complete new cluster, but that is a lot of work and the main thing I am unsure on is keeping the ingress LB IP. :-)
@ocdi I dropped and recreated.
I have the same problem. Cluster is unusable. Used acs-engine 0.19.1, windows server 1803, k8s version 1.11.0-rc.3
This is fixed in acs-engine 0.19.2. If you're still hitting it in that version or later, can you share details? Otherwise, can we close this issue?
@PatrickLang This may be a silly question, but how do we upgrade an existing cluster to 0.19.2? I can see how to upgrade if I am changing the kubernetes version, however I used 0.19.1 to upgrade to k8s 1.11.0 already and it is a no-op as I am already at the target version.
@PatrickLang does this require running 1803 on the host and in the container? When using 1709 host and container we still need to use the individual ip addresses of the DNS pods and not 10.0.0.10.
@PatrickLang doesn't work. acs-engine v1.19.3, k8s 1.11.0_rc3, server 1803. Windows pods cant access kubedns.
@atomaras could you try with 1.19.5 and 1.11.1 and server 1803? I was able to successfully do DNS queries from windows and linux pods with those version.
@jsturtevant Can I simply use acs-engine upgrade or do I have to recreate the cluster?
I usually drop and recreate to make sure everything is deployed properly.
@jsturtevant viable production approach
I upgraded an existing cluster from 1.11.0 to 1.11.1 with acs engine 0.19.5 with 1803 and so far so good, the baseline CPU usage has dropped which is nice, from 15-20% constant to maybe 10%. Not sure what about the previous version was using so much CPU but more for containers to run, the better. Haven't observed any DNS issues so far but it's only been an hour. :-)
@jsturtevant I recreated the cluster with k8s 1.11.1 and some windows containers work but others aren’t. I don’t know why. Specifically I run a windowsservercore:1803 busybox-style image in default namespace and DNS worked. Then I run my windows jenkins agent image based on dotnet framework 1803 inside jenkins namespace and it didn’t work (same as before).
Some extra observations: 1) aci-networking container still gets scheduled on windows nodes and fails so I have to patch the deployment and 2) initially I tried upgrading the cluster which resulted in only master node becoming 1.11.1 but other nodes remained at 1.11.0-rc3 so I ended up recreating the cluster
@atomaras I believe the issue your seeing is because you are in a separate namespace with the second pod. Could exec into the pod in the jenkins namespace and run ipconfig /all
and post output here? Can you connect to other pods when if you use the fully qualified name?
Additionally what happens when you run the jenkins deployment in the defualt namespace?
@ocdi Thanks for the update. If you see pods drop network connectivity/DNS over a given time drop a note here.
Yes, there's a problem where only the pod's namespace is added to the DNS suffix resolution list. https://github.com/kubernetes/kubernetes/issues/65016 mentions this as well. We need a specific fix in azure-cni so I'm checking to make sure a tracking issue is filed there
I narrowed it down to being tied to a specific node. I have 2 windows nodes. Node 31051k8s9001 works correctly:
but node 31051k8s9000 now fails with (this is the one that used to fail the DNS):
which is most likely tied to the DNS issue.
Please note that those nodes have barely any containers running on them.
Here's the issue for the incomplete DNS suffix list: https://github.com/Azure/azure-container-networking/issues/206
@atomaras - the failure you highlighted above is due to IP address exhaustion "Failed to allocate address: … No available addresses"
The error isn't being handled correctly due to https://github.com/Azure/azure-container-networking/issues/195
Thank you @PatrickLang ! I'll be keeping an eye out for these.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.
Is this a request for help?: NO
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: canary, GitCommit 8fd4ac4267c29370091d98d80c3046bed517dd8c
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6
What happened:
I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.
Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.
What you expected to happen: Requests to the internal DNS server should not time out.
Steps to reproduce:
Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:
Then run a Windows container. I used the following command:
kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell
Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:
Anything else we need to know: As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.
I observed the behavior independent from the values of the
networkPolicy
(none, azure) andorchestratorRelease
(1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod: