Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 561 forks source link

The cluster-internal DNS server cannot be used from Windows containers #2027

Closed chweidling closed 5 years ago

chweidling commented 6 years ago

Is this a request for help?: NO


Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE


What version of acs-engine?: canary, GitCommit 8fd4ac4267c29370091d98d80c3046bed517dd8c


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6

What happened:

I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.

Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.

What you expected to happen: Requests to the internal DNS server should not time out.

Steps to reproduce:

Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "kubernetesConfig": {
        "networkPolicy": "none"
      },
      "orchestratorRelease": "1.8"
    },
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "---",
      "vmSize": "Standard_D4s_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "backend",
        "count": 2,
        "osType": "Windows",
        "vmSize": "Standard_D4s_v3",
        "availabilityProfile": "AvailabilitySet"
      }      
    ],
    "windowsProfile": {
      "adminUsername": "---",
      "adminPassword": "---"
    },
    "linuxProfile": {
      "adminUsername": "weidling",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "ssh-rsa ---"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "---",
      "secret": "---"
    }
  }
}

Then run a Windows container. I used the following command: kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell

Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:

PS C:\> nslookup
DNS request timed out.
    timeout was 2 seconds.
Default Server:  UnKnown
Address:  10.0.0.10

> github.com
Server:  UnKnown
Address:  10.0.0.10

DNS request timed out.
    timeout was 2 seconds. 
(repeats 3 more times)
*** Request to UnKnown timed-out

> server 8.8.8.8
DNS request timed out.
    timeout was 2 seconds.
Default Server:  [8.8.8.8]
Address:  8.8.8.8

> github.com
Server:  [8.8.8.8]
Address:  8.8.8.8

Non-authoritative answer:
Name:    github.com
Addresses:  192.30.253.113
          192.30.253.112

> exit

Anything else we need to know: As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.

I observed the behavior independent from the values of the networkPolicy (none, azure) and orchestratorRelease (1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod:

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : mycore-96fdd75dc-8g5kd
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter vEthernet (9519cc22abb5ef39c786c5fbdce98c6a23be5ff1dced650ed9e338509db1eb35_l2bridge):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
   Physical Address. . . . . . . . . : 00-15-5D-87-0F-CC
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::a58c:aaf:c12b:d82c%21(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.244.2.92(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1
   DNS Servers . . . . . . . . . . . : 10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled
brunsgaard commented 6 years ago

@chweidling I will just let you know that you are not alone. My team and I have been batteling this today day with no luck at all. I think @JiangtianLi is looking into it (or at least similar issues). A quick search and look around the issues, shows that there are multiple problems with windows DNS and network right now.

ITler commented 6 years ago

I face an issue which sounds similar. I'm on AzureCloudGermany. However, I've troubles with linux-based (Ubuntu, Debian, Alpine) containers when it comes to DNS resolution, but only with multi-agent cluster. When only having one k8s agent node, this seems not to be a problem. Should I open up a separate github issue for that as this refers to Windows containers?

cpunella commented 6 years ago

Hi,

we are facing the same issue described from @chweidling . We have an hybrid cluster with both linux and windows nodes and only the windows node suffers to this problem.

@ITler yes, it seems that your issue is different... maybe it is better open a new issue ;)

4c74356b41 commented 6 years ago

I can confirm I'm seeing the exact behavior, dns doesnt work on kubernetes containers (if i créate container on the node using docker it Works)

JiangtianLi commented 6 years ago

@ITler Is your multi-agent cluster Linux only or hybrid? If it is Linux only, please file a different issue.

Josefczak commented 6 years ago

Maybe this helps in diagnosing the issue: I was able to get the pods working by changing the dns entry from ClusterIP to one of the dns pod IPs. netsh interface ip show config netsh interface ip set dns "****" static 10.244.0.3

ghost commented 6 years ago

Nice catch @Josefczak ! On our side, we also added the DNS suffix to let Windows containers to resolve short service names thanks to these Powershell commands:

$adapter=Get-NetAdapter
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3
Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"
rbankole commented 6 years ago

Josefczak thanks!

esheris commented 6 years ago

I don't have much to add here. I just came across this after much searching. I have the same timeout issue connecting to 10.0.0.10 using nslookup. While setting the containers dns is a solution, having to muck about with my container entrypoint to work around this issue doesn't seem like the greatest solution. Fortunately we are still in an early testing phase. Is there a bug to track somewhere for this specific issue?

4c74356b41 commented 6 years ago

@esheris I guess you are looking at it

brobichaud commented 6 years ago

Oh man, this solution @Josefczak found is what I've been looking for literally for 2 months. :-) I can get this to work if I manually connect to my pods, but am struggling with the dockerfile commands to automate this. Can anyone offer nanoserver dockerfile commands that work? (ie: no powershell required!)

JiangtianLi commented 6 years ago

@brobichaud You can try netsh, e.g., https://technet.microsoft.com/pt-pt/library/cc731521(v=ws.10).aspx#BKMK_setdnsserver

brobichaud commented 6 years ago

Yeah I did see that in the thread above but the problem is that it requires the interface name, which appears to be unique to the pod. Surely someone has already automated this in a dockerfile. This is a HUGE fix for a longstanding DNS issue in 1709 for me.

esheris commented 6 years ago

The only solution I can come up with is to modify my containers entrypoint to be a powershell script that runs the above commands then executes what I want to really run, in my case I ended up having some other things I needed to do with my web.config that now I have my docker file like so:

FROM microsoft/aspnet:4.7.1-windowsservercore-1709
COPY entrypoint.ps1 .
...
ENTRYPOINT [ "powershell.exe", "c:\\entrypoint.ps1"]

entrypoint.ps1 essentially looks like this


$adapter=Get-NetAdapter
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3
Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"
... web.config update ...
c:\ServiceMonitor.exe w3svc
brobichaud commented 6 years ago

@esheris @JiangtianLi I was able to come up with the PowerShell commands for servercore much like you have (though I put them inline in the dockerfile) but when I deploy my pod the DNS server hasn't changed. I suspect a permissions problem in the dockerfile. It's like it runs the commands but they fail to apply. I can still remote into my pod and manually issue the same commands and then my already running app suddenly starts working. Here is the relevant snippet from my dockerfile:

SHELL ["powershell", "-command"]
RUN "$adapter=Get-NetAdapter; \
    Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3;"

Does anyone know how to correctly elevate permissions in the dockerfile for Windows?

esheris commented 6 years ago

You can't really do this in the dockerfile directly as the underlying nic of the container will change and you are setting the dns based on it. This is why I had to modify my container entrypoint. you have to set dns when the container starts.

brobichaud commented 6 years ago

Ahhh, I see. That does explain why it failed to work in my dockerfile. Ugh, your workaround is ingenious but so ugly and feels so hacky. Alas it DOES work, and I thank you @esheris!

A couple of questions maybe you can answer:

  1. I see you start your entrypoint with servicemonitor.exe. If I'm running a dotnet core app can I just run my executable or do I also need to somehow use servicemonitor? (it works just running my exe but I'm concerned I'm missing out on some pertinent feature by not using servicemonitor.exe)
  2. Have you had any luck with this same technique on nanoserver? I'm struggling to find the right commands to do this without powershell. The netsh example earlier in this thread require the InterfaceName, which is dynamic in the pod.
esheris commented 6 years ago

I certainly agree that it feels hacky, I expressed similar in my original post Sorry, my app is a .net4.7 app with some webforms stuff in it so we can't run nano server/.netcore so i'm not really positive on how to answer your questions. I just got the default entrypoint of one of my older images (docker inspect imageguid) and tacked it on to the bottom of my entrypoint script.

I just pulled microsoft/nanoserver:latest and launched it (docker run -it microsoft/nanoserver:latest powershell) and it seems get-netadapter and set-dnsclienterveraddress/set-dnsclient are there

brobichaud commented 6 years ago

Unfortunately nanoserver:latest is Server 1607, and I really need Server 1709 (yeah wierd decision on Microsofts part). Server 1709 removed PowerShell support. :-( I'll continue iterating on it and post a response here if I come up with a solution for nanoserver 1709. Or I may resort to using the new powershell core in nanoserver 1709.

esheris commented 6 years ago

You could assume the nic name which should always be the same and set it with netsh that way

netsh interface ip set dns "Ethernet" static <dnsip>

run/exec into your container and validate its name first, "Ethernet" was what I had in my previously mentioned nanoserver container

brobichaud commented 6 years ago

I can see if I open a new nanoserver container locally the name is always "Ethernet" but the interface name appears to be dynamic in an ACS k8s pod. For example mine is now:

vEthernet (beb30eddfc08797307915783cb1c32039566d8f9ac7911334cbebd8dd0e366a2_l2bridge)

But to prove this even can work with netsh I opened a command prompt in my pod and tried to do it manually, the result is:

The requested operation requires elevation (Run as administrator)

Do you know how to elevate a command prompt in a container/pod?

brobichaud commented 6 years ago

Argh. Roadblocked here with nanoserver. The elevation issue has prevented me from pursuing the netsh approach. I cannot find anything on how I can elevate to admin in a nanoserver command prompt.

So then I thought maybe I'd explore the PowerShell Core path with nanoserver since I've got a script that works on servercore. Alas PowerShell Core does not support Set-DnsClientServerAddress. I suspect because that cmdlet is very Windows specific and Core is designed as x-plat.

Dead-end. I can of course migrate my DotNet Core app to run on servercore, which I don't really want as it feels like a step backwards. And it means automating the install of DotNet Core since there is no pre-built servercore image with DotNet Core.

I gotta say, nanoserver is easy to love and yet even easier to hate. :-(

esheris commented 6 years ago

Runas is the general command I believe. Not sure what you would run it as though. Perhaps try setting up the entry point script with the netsh commands in it, perhaps being launched out of the main container process will give you perms


From: Brett Robichaud notifications@github.com Sent: Tuesday, February 6, 2018 4:13:38 PM To: Azure/acs-engine Cc: esheris; Mention Subject: Re: [Azure/acs-engine] The cluster-internal DNS server cannot be used from Windows containers (#2027)

Argh. Roadblocked here with nanoserver. The elevation issue has prevented me from pursuing the netsh approach. I cannot find anything on how I can elevate to admin in a nanoserver container command prompt.

So then I thought maybe I'd explore the PowerShell Core path with nanoserver since I've got a script that works on servercore. Alas PowerShell Core does not support Set-DnsClientServerAddress. I suspect because that cmdlet is very Windows specific and Core is designed as x-plat.

Dead-end. I can of course migrate my DotNet Core app to run on servercore, which I don't really want as it feels like a step backwards. And it means automating the install of DotNet Core since there is no pre-built servercore image with DotNet Core.

I gotta say, nanoserver is easy to love and yet even easier to hate. :-(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Azure/acs-engine/issues/2027#issuecomment-363610892, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHAf_aROl3_8Gl6Yc9EwmCVq93NyUXPCks5tSOqygaJpZM4RZzhA.

brobichaud commented 6 years ago

A good suggestion @esheris on the idea of running an entrypoint script. Alas I tried and see the same error about elevation being required. Feels like I am so close as I have discovered that the interface index is consistently 30, so if I had permissions I could use this command to set the DNS server:

netsh interface ip set dns 30 static 10.244.0.3

As for runas, it does not exist in nanoserver. Blocked by nanoserver at every path it feels! I may have to step back and move my nanoserver use to servercore until Microsoft gets this fixed. Sooo not what I want to do, I really want to get some legs on nanoserver as we are building up this greenfield app, not migrate it later and see what breaks all at once! :-(

msorby commented 6 years ago

This one is of interest #2230

In the mean time I have this a workaround to find the current IP addresses of kube-dns pods. I'm running servercore so I can use Set-DnsClientServerAddress.

4c74356b41 commented 6 years ago

just so anyone runs into the same issue, for me the workaround didnt work, unless I added Start-Sleep 10 (will probably work with less).

brobichaud commented 6 years ago

I looked through #2230 and it does look interesting but its not clear to me that it addresses this issue. Clearly there are other DNS issues in Windows 1709 itself, but I wonder if the problem we are seeing is in fact Windows or the way k8s is setup with Windows nodes?

This just feels like such a huge roadblocker of an issue that it should be of highest priority to get fixed.

msorby commented 6 years ago

Yeah it is a huge blocker. I actually pulled down the pull request and merged in the latest changes from acs-engine\master and built it. Still no DNS resolution from 1709 containers...

JiangtianLi commented 6 years ago

Sorry of the inconvenience. We are going to rollout patch in Windows for acs-engine to mitigate the DNS issue ASAP. I will update here.

brobichaud commented 6 years ago

Thank you @JiangtianLi! Do you feel like you have a full grasp of what the DNS issues are? I ask because I've seen a lot of talk about a DNS issue where it works for some short period (15 mins?) and then stops working. The problem we are seeing here is that DNS blatantly does not work from the very start at all in Windows containers on 1709 nodes. I'm just trying to be thorough in making sure you guys are seeing and fixing/mitigating this very specific DNS issue as well.

JiangtianLi commented 6 years ago

@brobichaud 15 min delay in DNS is one issue. Another issue was a regression in Windows update in Jan that affects service vip on Windows node and therefore kube-dns. So there will be two patches that fix the two issues separately.

brobichaud commented 6 years ago

@JiangtianLi, I am happy to hear you are fully on top of both issues. Thank you! And please do update this issue when we can utilize the fix. :-)

msorby commented 6 years ago

Thank you for the update @JiangtianLi

4c74356b41 commented 6 years ago

so this is a separate issue? when this issue is going to be fixed?

matthewflannery commented 6 years ago

@JiangtianLi has there been any update? I am affected by this issue regardless of this apparent 15 minute time..

@brobichaud - you can RDP to the Windows host your container is running on, and then docker exec -it -u Administrator <containername> cmd.exe and set it - perhaps this can assist you in verifying the below with what I am seeing using 1709 containers on the 1709 host.

I did this, then executed the above command to set DNS statically for the interface:

netsh interface ip set dns "vEthernet (237be51c6e481e484b44557ead0d420912f83bd32b21a4038c2ec3ac23e81d21_l2bridge)" static 10.244.1.6

However DNS still does not work from my Windows containers. Just to show I was using the correct pod IP..

adm@k8s-master-85975145-0:~$ kubectl get ep --namespace=kube-system kube-dns -o yaml | grep ip
  - ip: 10.244.1.4
  - ip: 10.244.1.6
adm@k8s-master-85975145-0:~$ kubectl get pod --namespace=kube-system kube-dns-v20-3003781527-ws75z -o wide
NAME                            READY     STATUS    RESTARTS   AGE       IP           NODE
kube-dns-v20-3003781527-ws75z   3/3       Running   0          5h        10.244.1.6   k8s-linuxpool1-85975145-0

I can also reach the pod IP for that kube-dns:

C:\publish>ping 10.244.1.6
Pinging 10.244.1.6 with 32 bytes of data:
Reply from 10.244.1.6: bytes=32 time=2ms TTL=63

What's funny that i've noticed.. Inside a pod, I can resolve the DNS names of some pods, regardless if they are on the same node or a different node, using the pod name and it returns an IPv6 address - I can't use the container name though, which seems fine from Linux. What's even weirder, is that while I can resolve some pods using the podname, I can't even resolve kubernetes.default...

C:\publish>ping api-actionlogging-204518368-2s13q
Ping request could not find host api-actionlogging-204518368-2s13q. Please check the name and try again.
C:\publish>ping api-content-3588623890-6mw4s

Pinging api-content-3588623890-6mw4s [fe80::9b4:4a0b:4658:9905%20] with 32 bytes of data:
Reply from fe80::9b4:4a0b:4658:9905%20: time<1ms
Reply from fe80::9b4:4a0b:4658:9905%20: time<1ms
Reply from fe80::9b4:4a0b:4658:9905%20: time<1ms

Ping statistics for fe80::9b4:4a0b:4658:9905%20:
    Packets: Sent = 3, Received = 3, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 0ms, Average = 0ms
Control-C
^C
C:\publish>ping api-logging-764837346-79wd9

Pinging api-logging-764837346-79wd9 [fe80::9525:ee0f:af92:2362%20] with 32 bytes of data:
Reply from fe80::9525:ee0f:af92:2362%20: time<1ms
Reply from fe80::9525:ee0f:af92:2362%20: time<1ms

Ping statistics for fe80::9525:ee0f:af92:2362%20:
    Packets: Sent = 2, Received = 2, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 0ms, Average = 0ms
Control-C

Some further info, I've seen https://github.com/Microsoft/SDN/issues/150 but, i'm unable to get DNS resolving with the FQDN on Windows containers..

ping api-session.default.svc.cluster.local
Ping request could not find host api-session.default.svc.cluster.local. Please check the name and try again.

So, pretty much, some pod names resolve, but I can't resolve services, containers, or FQDNs. I'm still facing this issue without any workaround, using acs-engine to deploy hybrid cluster (2 win, 1 nix) with windows host OS: Server 1709 as per the default latest acs-engine.

4c74356b41 commented 6 years ago

btw, sometimes the workaround doesnt work for me either and this happens quite randomly

matthewflannery commented 6 years ago

I've deployed the cluster using hybrid multiple times and have never had DNS working. Just deployed a windows only cluster and same issues~

brobichaud commented 6 years ago

@matthewflannery to clarify, are you saying that you've deployed multiple hybrid clusters and see no DNS issues on the Windows nodes, but then on a Windows only cluster you do see DNS issues?

If that's so I'd gladly deploy a hybrid with 1 Linux node if I could move forward without hack on the WIndows side. Lemme know if what I'm hearing is right...

JiangtianLi commented 6 years ago

The windows packages has been applied into acs-engine and merged into master HEAD. The service vip issue was verified to be fixed and windows pod can talk to kube-dns now. The caveat with the dns config fix is that it depends on the proper container image which also needs to be patched. Therefore it doesn't work in acs-engine currently and may have random dns not configured error in windows container until the windows container image in docker hub is updated with Feb windows update.

brobichaud commented 6 years ago

@JiangtianLi that is definitely huge progress. And can you still plan to circle back and update this issue as soon as those updated containers are available?

JiangtianLi commented 6 years ago

@brobichaud Sure, will do.

matthewflannery commented 6 years ago

@brobichaud To clarify,

I have deployed hybrid and windows-only clusters. I have never had DNS working within Windows containers, regardless of the cluser configuration.

4c74356b41 commented 6 years ago

@JiangtianLi please circle back when the update is there. I assume we would need to rebuild clusters from scratch?

JiangtianLi commented 6 years ago

@4c74356b41 Rebuilding cluster is not a must but recommended. Otherwise manually patching and updating each windows node is required. Sorry for inconvenience.

croemmich commented 6 years ago

@JiangtianLi, is the patch included in windowsservercore/1709_KB4074588? I was able to get DNS working, but still have no outbound internet connection.

msorby commented 6 years ago

@croemmich Care to share your acs-engine json file? Because I just created a new cluster with ace-engine 0.13.0 and got "Failed to create sandbox" when creating windows container. Same definition I've been using (except I was bold and updated to Kubernetes 1.9.3, from 1.9.1).

croemmich commented 6 years ago

@msorby Unfortunately I'm not actually running on ACS but having the same networking issues which seem to be related to Windows as a whole and not specifically ACS.

msorby commented 6 years ago

@croemmich So it was the usage of Kuberentes 1.9.3 that messed it up. I have now deployed a hybrid cluster with Kubernetes 1.9.1 using acs-engine 0.13.0. And I can successfully create a windows container using microsoft/windowsservercore:1709_KB4074588 and do a wget from within that container to an exernal site without any hacks!

jbiel commented 6 years ago

@JiangtianLi - can you please advise whether or not an updated wincni.exe is needed to fix this and if it will be published publicly at https://github.com/Microsoft/SDN/tree/master/Kubernetes/windows/cni?

JiangtianLi commented 6 years ago

@jbiel acs-engine already includes an updated wincni.exe built by Windows team.

@madhanrm could you have an answer to publishing wincni.exe at GitHub?