Closed chweidling closed 5 years ago
@chweidling I will just let you know that you are not alone. My team and I have been batteling this today day with no luck at all. I think @JiangtianLi is looking into it (or at least similar issues). A quick search and look around the issues, shows that there are multiple problems with windows DNS and network right now.
I face an issue which sounds similar. I'm on AzureCloudGermany. However, I've troubles with linux-based (Ubuntu, Debian, Alpine) containers when it comes to DNS resolution, but only with multi-agent cluster. When only having one k8s agent node, this seems not to be a problem. Should I open up a separate github issue for that as this refers to Windows containers?
Hi,
we are facing the same issue described from @chweidling . We have an hybrid cluster with both linux and windows nodes and only the windows node suffers to this problem.
@ITler yes, it seems that your issue is different... maybe it is better open a new issue ;)
I can confirm I'm seeing the exact behavior, dns doesnt work on kubernetes containers (if i créate container on the node using docker it Works)
@ITler Is your multi-agent cluster Linux only or hybrid? If it is Linux only, please file a different issue.
Maybe this helps in diagnosing the issue: I was able to get the pods working by changing the dns entry from ClusterIP to one of the dns pod IPs. netsh interface ip show config netsh interface ip set dns "****" static 10.244.0.3
Nice catch @Josefczak ! On our side, we also added the DNS suffix to let Windows containers to resolve short service names thanks to these Powershell commands:
$adapter=Get-NetAdapter
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3
Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"
Josefczak thanks!
I don't have much to add here. I just came across this after much searching. I have the same timeout issue connecting to 10.0.0.10 using nslookup. While setting the containers dns is a solution, having to muck about with my container entrypoint to work around this issue doesn't seem like the greatest solution. Fortunately we are still in an early testing phase. Is there a bug to track somewhere for this specific issue?
@esheris I guess you are looking at it
Oh man, this solution @Josefczak found is what I've been looking for literally for 2 months. :-) I can get this to work if I manually connect to my pods, but am struggling with the dockerfile commands to automate this. Can anyone offer nanoserver dockerfile commands that work? (ie: no powershell required!)
@brobichaud You can try netsh, e.g., https://technet.microsoft.com/pt-pt/library/cc731521(v=ws.10).aspx#BKMK_setdnsserver
Yeah I did see that in the thread above but the problem is that it requires the interface name, which appears to be unique to the pod. Surely someone has already automated this in a dockerfile. This is a HUGE fix for a longstanding DNS issue in 1709 for me.
The only solution I can come up with is to modify my containers entrypoint to be a powershell script that runs the above commands then executes what I want to really run, in my case I ended up having some other things I needed to do with my web.config that now I have my docker file like so:
FROM microsoft/aspnet:4.7.1-windowsservercore-1709
COPY entrypoint.ps1 .
...
ENTRYPOINT [ "powershell.exe", "c:\\entrypoint.ps1"]
entrypoint.ps1 essentially looks like this
$adapter=Get-NetAdapter
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3
Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"
... web.config update ...
c:\ServiceMonitor.exe w3svc
@esheris @JiangtianLi I was able to come up with the PowerShell commands for servercore much like you have (though I put them inline in the dockerfile) but when I deploy my pod the DNS server hasn't changed. I suspect a permissions problem in the dockerfile. It's like it runs the commands but they fail to apply. I can still remote into my pod and manually issue the same commands and then my already running app suddenly starts working. Here is the relevant snippet from my dockerfile:
SHELL ["powershell", "-command"]
RUN "$adapter=Get-NetAdapter; \
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3;"
Does anyone know how to correctly elevate permissions in the dockerfile for Windows?
You can't really do this in the dockerfile directly as the underlying nic of the container will change and you are setting the dns based on it. This is why I had to modify my container entrypoint. you have to set dns when the container starts.
Ahhh, I see. That does explain why it failed to work in my dockerfile. Ugh, your workaround is ingenious but so ugly and feels so hacky. Alas it DOES work, and I thank you @esheris!
A couple of questions maybe you can answer:
I certainly agree that it feels hacky, I expressed similar in my original post Sorry, my app is a .net4.7 app with some webforms stuff in it so we can't run nano server/.netcore so i'm not really positive on how to answer your questions. I just got the default entrypoint of one of my older images (docker inspect imageguid) and tacked it on to the bottom of my entrypoint script.
I just pulled microsoft/nanoserver:latest and launched it (docker run -it microsoft/nanoserver:latest powershell) and it seems get-netadapter and set-dnsclienterveraddress/set-dnsclient are there
Unfortunately nanoserver:latest is Server 1607, and I really need Server 1709 (yeah wierd decision on Microsofts part). Server 1709 removed PowerShell support. :-( I'll continue iterating on it and post a response here if I come up with a solution for nanoserver 1709. Or I may resort to using the new powershell core in nanoserver 1709.
You could assume the nic name which should always be the same and set it with netsh that way
netsh interface ip set dns "Ethernet" static <dnsip>
run/exec into your container and validate its name first, "Ethernet" was what I had in my previously mentioned nanoserver container
I can see if I open a new nanoserver container locally the name is always "Ethernet" but the interface name appears to be dynamic in an ACS k8s pod. For example mine is now:
vEthernet (beb30eddfc08797307915783cb1c32039566d8f9ac7911334cbebd8dd0e366a2_l2bridge)
But to prove this even can work with netsh I opened a command prompt in my pod and tried to do it manually, the result is:
The requested operation requires elevation (Run as administrator)
Do you know how to elevate a command prompt in a container/pod?
Argh. Roadblocked here with nanoserver. The elevation issue has prevented me from pursuing the netsh approach. I cannot find anything on how I can elevate to admin in a nanoserver command prompt.
So then I thought maybe I'd explore the PowerShell Core path with nanoserver since I've got a script that works on servercore. Alas PowerShell Core does not support Set-DnsClientServerAddress. I suspect because that cmdlet is very Windows specific and Core is designed as x-plat.
Dead-end. I can of course migrate my DotNet Core app to run on servercore, which I don't really want as it feels like a step backwards. And it means automating the install of DotNet Core since there is no pre-built servercore image with DotNet Core.
I gotta say, nanoserver is easy to love and yet even easier to hate. :-(
Runas is the general command I believe. Not sure what you would run it as though. Perhaps try setting up the entry point script with the netsh commands in it, perhaps being launched out of the main container process will give you perms
From: Brett Robichaud notifications@github.com Sent: Tuesday, February 6, 2018 4:13:38 PM To: Azure/acs-engine Cc: esheris; Mention Subject: Re: [Azure/acs-engine] The cluster-internal DNS server cannot be used from Windows containers (#2027)
Argh. Roadblocked here with nanoserver. The elevation issue has prevented me from pursuing the netsh approach. I cannot find anything on how I can elevate to admin in a nanoserver container command prompt.
So then I thought maybe I'd explore the PowerShell Core path with nanoserver since I've got a script that works on servercore. Alas PowerShell Core does not support Set-DnsClientServerAddress. I suspect because that cmdlet is very Windows specific and Core is designed as x-plat.
Dead-end. I can of course migrate my DotNet Core app to run on servercore, which I don't really want as it feels like a step backwards. And it means automating the install of DotNet Core since there is no pre-built servercore image with DotNet Core.
I gotta say, nanoserver is easy to love and yet even easier to hate. :-(
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Azure/acs-engine/issues/2027#issuecomment-363610892, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHAf_aROl3_8Gl6Yc9EwmCVq93NyUXPCks5tSOqygaJpZM4RZzhA.
A good suggestion @esheris on the idea of running an entrypoint script. Alas I tried and see the same error about elevation being required. Feels like I am so close as I have discovered that the interface index is consistently 30, so if I had permissions I could use this command to set the DNS server:
netsh interface ip set dns 30 static 10.244.0.3
As for runas, it does not exist in nanoserver. Blocked by nanoserver at every path it feels! I may have to step back and move my nanoserver use to servercore until Microsoft gets this fixed. Sooo not what I want to do, I really want to get some legs on nanoserver as we are building up this greenfield app, not migrate it later and see what breaks all at once! :-(
This one is of interest #2230
In the mean time I have this a workaround to find the current IP addresses of kube-dns pods. I'm running servercore so I can use Set-DnsClientServerAddress.
just so anyone runs into the same issue, for me the workaround didnt work, unless I added Start-Sleep 10
(will probably work with less).
I looked through #2230 and it does look interesting but its not clear to me that it addresses this issue. Clearly there are other DNS issues in Windows 1709 itself, but I wonder if the problem we are seeing is in fact Windows or the way k8s is setup with Windows nodes?
This just feels like such a huge roadblocker of an issue that it should be of highest priority to get fixed.
Yeah it is a huge blocker. I actually pulled down the pull request and merged in the latest changes from acs-engine\master and built it. Still no DNS resolution from 1709 containers...
Sorry of the inconvenience. We are going to rollout patch in Windows for acs-engine to mitigate the DNS issue ASAP. I will update here.
Thank you @JiangtianLi! Do you feel like you have a full grasp of what the DNS issues are? I ask because I've seen a lot of talk about a DNS issue where it works for some short period (15 mins?) and then stops working. The problem we are seeing here is that DNS blatantly does not work from the very start at all in Windows containers on 1709 nodes. I'm just trying to be thorough in making sure you guys are seeing and fixing/mitigating this very specific DNS issue as well.
@brobichaud 15 min delay in DNS is one issue. Another issue was a regression in Windows update in Jan that affects service vip on Windows node and therefore kube-dns. So there will be two patches that fix the two issues separately.
@JiangtianLi, I am happy to hear you are fully on top of both issues. Thank you! And please do update this issue when we can utilize the fix. :-)
Thank you for the update @JiangtianLi
so this is a separate issue? when this issue is going to be fixed?
@JiangtianLi has there been any update? I am affected by this issue regardless of this apparent 15 minute time..
@brobichaud - you can RDP to the Windows host your container is running on, and then docker exec -it -u Administrator <containername> cmd.exe
and set it - perhaps this can assist you in verifying the below with what I am seeing using 1709 containers on the 1709 host.
I did this, then executed the above command to set DNS statically for the interface:
netsh interface ip set dns "vEthernet (237be51c6e481e484b44557ead0d420912f83bd32b21a4038c2ec3ac23e81d21_l2bridge)" static 10.244.1.6
However DNS still does not work from my Windows containers. Just to show I was using the correct pod IP..
adm@k8s-master-85975145-0:~$ kubectl get ep --namespace=kube-system kube-dns -o yaml | grep ip
- ip: 10.244.1.4
- ip: 10.244.1.6
adm@k8s-master-85975145-0:~$ kubectl get pod --namespace=kube-system kube-dns-v20-3003781527-ws75z -o wide
NAME READY STATUS RESTARTS AGE IP NODE
kube-dns-v20-3003781527-ws75z 3/3 Running 0 5h 10.244.1.6 k8s-linuxpool1-85975145-0
I can also reach the pod IP for that kube-dns:
C:\publish>ping 10.244.1.6
Pinging 10.244.1.6 with 32 bytes of data:
Reply from 10.244.1.6: bytes=32 time=2ms TTL=63
What's funny that i've noticed.. Inside a pod, I can resolve the DNS names of some pods, regardless if they are on the same node or a different node, using the pod name and it returns an IPv6 address - I can't use the container name though, which seems fine from Linux. What's even weirder, is that while I can resolve some pods using the podname, I can't even resolve kubernetes.default...
C:\publish>ping api-actionlogging-204518368-2s13q
Ping request could not find host api-actionlogging-204518368-2s13q. Please check the name and try again.
C:\publish>ping api-content-3588623890-6mw4s
Pinging api-content-3588623890-6mw4s [fe80::9b4:4a0b:4658:9905%20] with 32 bytes of data:
Reply from fe80::9b4:4a0b:4658:9905%20: time<1ms
Reply from fe80::9b4:4a0b:4658:9905%20: time<1ms
Reply from fe80::9b4:4a0b:4658:9905%20: time<1ms
Ping statistics for fe80::9b4:4a0b:4658:9905%20:
Packets: Sent = 3, Received = 3, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
Control-C
^C
C:\publish>ping api-logging-764837346-79wd9
Pinging api-logging-764837346-79wd9 [fe80::9525:ee0f:af92:2362%20] with 32 bytes of data:
Reply from fe80::9525:ee0f:af92:2362%20: time<1ms
Reply from fe80::9525:ee0f:af92:2362%20: time<1ms
Ping statistics for fe80::9525:ee0f:af92:2362%20:
Packets: Sent = 2, Received = 2, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
Control-C
Some further info, I've seen https://github.com/Microsoft/SDN/issues/150 but, i'm unable to get DNS resolving with the FQDN on Windows containers..
ping api-session.default.svc.cluster.local
Ping request could not find host api-session.default.svc.cluster.local. Please check the name and try again.
So, pretty much, some pod names resolve, but I can't resolve services, containers, or FQDNs. I'm still facing this issue without any workaround, using acs-engine to deploy hybrid cluster (2 win, 1 nix) with windows host OS: Server 1709 as per the default latest acs-engine.
btw, sometimes the workaround doesnt work for me either and this happens quite randomly
I've deployed the cluster using hybrid multiple times and have never had DNS working. Just deployed a windows only cluster and same issues~
@matthewflannery to clarify, are you saying that you've deployed multiple hybrid clusters and see no DNS issues on the Windows nodes, but then on a Windows only cluster you do see DNS issues?
If that's so I'd gladly deploy a hybrid with 1 Linux node if I could move forward without hack on the WIndows side. Lemme know if what I'm hearing is right...
The windows packages has been applied into acs-engine and merged into master HEAD. The service vip issue was verified to be fixed and windows pod can talk to kube-dns now. The caveat with the dns config fix is that it depends on the proper container image which also needs to be patched. Therefore it doesn't work in acs-engine currently and may have random dns not configured error in windows container until the windows container image in docker hub is updated with Feb windows update.
@JiangtianLi that is definitely huge progress. And can you still plan to circle back and update this issue as soon as those updated containers are available?
@brobichaud Sure, will do.
@brobichaud To clarify,
I have deployed hybrid and windows-only clusters. I have never had DNS working within Windows containers, regardless of the cluser configuration.
@JiangtianLi please circle back when the update is there. I assume we would need to rebuild clusters from scratch?
@4c74356b41 Rebuilding cluster is not a must but recommended. Otherwise manually patching and updating each windows node is required. Sorry for inconvenience.
@JiangtianLi, is the patch included in windowsservercore/1709_KB4074588? I was able to get DNS working, but still have no outbound internet connection.
@croemmich Care to share your acs-engine json file? Because I just created a new cluster with ace-engine 0.13.0 and got "Failed to create sandbox" when creating windows container. Same definition I've been using (except I was bold and updated to Kubernetes 1.9.3, from 1.9.1).
@msorby Unfortunately I'm not actually running on ACS but having the same networking issues which seem to be related to Windows as a whole and not specifically ACS.
@croemmich So it was the usage of Kuberentes 1.9.3 that messed it up. I have now deployed a hybrid cluster with Kubernetes 1.9.1 using acs-engine 0.13.0. And I can successfully create a windows container using microsoft/windowsservercore:1709_KB4074588 and do a wget from within that container to an exernal site without any hacks!
@JiangtianLi - can you please advise whether or not an updated wincni.exe is needed to fix this and if it will be published publicly at https://github.com/Microsoft/SDN/tree/master/Kubernetes/windows/cni?
@jbiel acs-engine already includes an updated wincni.exe built by Windows team.
@madhanrm could you have an answer to publishing wincni.exe at GitHub?
Is this a request for help?: NO
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: canary, GitCommit 8fd4ac4267c29370091d98d80c3046bed517dd8c
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6
What happened:
I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.
Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.
What you expected to happen: Requests to the internal DNS server should not time out.
Steps to reproduce:
Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:
Then run a Windows container. I used the following command:
kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell
Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:
Anything else we need to know: As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.
I observed the behavior independent from the values of the
networkPolicy
(none, azure) andorchestratorRelease
(1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod: