Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 560 forks source link

The cluster-internal DNS server cannot be used from Windows containers #2027

Closed chweidling closed 5 years ago

chweidling commented 6 years ago

Is this a request for help?: NO


Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE


What version of acs-engine?: canary, GitCommit 8fd4ac4267c29370091d98d80c3046bed517dd8c


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6

What happened:

I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.

Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.

What you expected to happen: Requests to the internal DNS server should not time out.

Steps to reproduce:

Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "kubernetesConfig": {
        "networkPolicy": "none"
      },
      "orchestratorRelease": "1.8"
    },
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "---",
      "vmSize": "Standard_D4s_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "backend",
        "count": 2,
        "osType": "Windows",
        "vmSize": "Standard_D4s_v3",
        "availabilityProfile": "AvailabilitySet"
      }      
    ],
    "windowsProfile": {
      "adminUsername": "---",
      "adminPassword": "---"
    },
    "linuxProfile": {
      "adminUsername": "weidling",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "ssh-rsa ---"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "---",
      "secret": "---"
    }
  }
}

Then run a Windows container. I used the following command: kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell

Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:

PS C:\> nslookup
DNS request timed out.
    timeout was 2 seconds.
Default Server:  UnKnown
Address:  10.0.0.10

> github.com
Server:  UnKnown
Address:  10.0.0.10

DNS request timed out.
    timeout was 2 seconds. 
(repeats 3 more times)
*** Request to UnKnown timed-out

> server 8.8.8.8
DNS request timed out.
    timeout was 2 seconds.
Default Server:  [8.8.8.8]
Address:  8.8.8.8

> github.com
Server:  [8.8.8.8]
Address:  8.8.8.8

Non-authoritative answer:
Name:    github.com
Addresses:  192.30.253.113
          192.30.253.112

> exit

Anything else we need to know: As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.

I observed the behavior independent from the values of the networkPolicy (none, azure) and orchestratorRelease (1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod:

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : mycore-96fdd75dc-8g5kd
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter vEthernet (9519cc22abb5ef39c786c5fbdce98c6a23be5ff1dced650ed9e338509db1eb35_l2bridge):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
   Physical Address. . . . . . . . . : 00-15-5D-87-0F-CC
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::a58c:aaf:c12b:d82c%21(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.244.2.92(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1
   DNS Servers . . . . . . . . . . . : 10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled
4c74356b41 commented 6 years ago

@j03wang I can confirm metrics-server bug. it targets windows nodes. @JiangtianLi had this issue multiple times

JiangtianLi commented 6 years ago

@4c74356b41 Please file an issue about metrics-server bug. Thanks.

madhanrm commented 6 years ago

Any one having DNS or Service ViP Access issues, please collect the logs and pass it us, so that we can evaluate if there are any new issues being hit.

copy the folder https://github.com/Microsoft/SDN/tree/master/Kubernetes/windows/debug and execute "powershell collectlogs.ps1"

pushkar-bitwise commented 6 years ago

Hi @madhanrm please find link to log file https://pushkardebugdata.blob.core.windows.net/debugdata/archive.zip

madhanrm commented 6 years ago

@pushkar-bitwise In you setup, I don't see any policy lists configured by kubeproxy for the remote services (like DNS). Only 2 Policy lists exists, which points to the service deployed locally. So DNS wouldn't work until they are programmed properly.

This is a known issue, where remote endpoint creation fails & policies cannot be created for remote services (like DNS). You can verify this by doing below in powershell and you would not notice any endpoints with IsRemoteEndpoint field set to true.

get-hnsendpoint | select IsRemoteEndpoint, Id, IPAddress

In ideal world, you would see several endpoints with IsRemoteEndpoint set to true

There was an windows update pushed for this issue. @JiangtianLi can you point @pushkar-bitwise how to get the update on the Azure VM?

madhanrm commented 6 years ago

@JiangtianLi is there a separate issue for the remote endpoint creation? If not can we create one and add steps on how to recover from that.

The current github issue page has multiple issues and it would confuse folks looking into this thread.

pushkar-bitwise commented 6 years ago

Thanks @madhanrm for your help, @JiangtianLi Please help how to get update on azure VM, also it there any way to create it manually.

pushkar-bitwise commented 6 years ago

@madhanrm I just executed command you suggested below are the details

PS C:\data> Get-HnsEndpoint | Select IsRemoteEndpoint, Id, IpAddress

IsRemoteEndpoint ID                                   IPAddress
---------------- --                                   ---------
                 6ea68d9b-d6ca-41e3-bcd4-d4958b9bf4b0 10.244.3.220
                 0b2595e8-c833-4486-bf4b-0d7f69aaefb7 10.244.3.99
                 a84ebd45-205f-4025-991e-0b1108167054 10.244.3.41
                 532684f7-a5a3-48d6-96e6-4c372ab48d8b 10.244.3.176
                 0ea66390-b285-4648-a619-1b8dc24c2b5e 10.244.3.94
                 90d73037-2cad-4016-be01-c0818dbdb4a4 10.244.3.123
                 99921570-d03c-4815-9647-55c959fa9326 10.244.3.208
                 ff167245-e00f-401c-bb57-4e34f65488df 10.244.3.136
                 99e090f0-ad6e-42c6-8d02-6ecfa49abf10 10.244.3.224
                 fe9731cd-2ab3-4db1-a05c-a188250c042b 10.244.3.109
                 3c5dc6f6-c0e4-427b-90b2-1febe7d49fcb 10.244.3.18
                 6c871850-d2f1-4265-ace4-24b93169ef36 10.244.3.55
                 abe63d7f-3386-4038-82c1-716f3a7cd76f 10.244.3.139
                 afcdc35e-77f0-48a4-9ad5-77ef970dc226 10.244.3.174
                 ecbea8f5-e645-4c3b-9546-7acd3555b645 10.244.3.131
                 9ba0b809-0621-4281-8ec9-98222a0fa762 10.244.3.52
                 f4171599-85df-48a5-8c16-90bb054a8b29 10.244.3.115
                 3fe3ae7a-303a-4a7c-bf60-fda182b345c9 10.244.3.249
                 581c715c-4cf7-4ebd-aed9-b30bb17cc8ab 10.244.3.119
True             d453a703-14aa-4c99-9ddf-0a6b23d5f0d2 10.244.2.8

Logs from Container

PS C:\inetpub\wwwroot> Test-NetConnection 10.0.0.10 -Port 53
WARNING: TCP connect to (10.0.0.10 : 53) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut

ComputerName           : 10.0.0.10
RemoteAddress          : 10.0.0.10
RemotePort             : 53
InterfaceAlias         : vEthernet (1b354b9efc9ec2232fbb543d7de51c1b6d98eb3ecf70ed62a5df6c519337c021_l2bridge)
SourceAddress          : 10.244.3.41
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

PS C:\inetpub\wwwroot> ping 10.244.2.8

Pinging 10.244.2.8 with 32 bytes of data:
Reply from 10.244.2.8: bytes=32 time=1ms TTL=63
Reply from 10.244.2.8: bytes=32 time=1ms TTL=63
Reply from 10.244.2.8: bytes=32 time<1ms TTL=63

Ping statistics for 10.244.2.8:
    Packets: Sent = 3, Received = 3, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 1ms, Average = 0ms
Control-C
PS C:\inetpub\wwwroot> Test-NetConnection 10.244.2.8  -Port 53

ComputerName     : 10.244.2.8
RemoteAddress    : 10.244.2.8
RemotePort       : 53
InterfaceAlias   : vEthernet (1b354b9efc9ec2232fbb543d7de51c1b6d98eb3ecf70ed62a5df6c519337c021_l2bridge)
SourceAddress    : 10.244.3.41
TcpTestSucceeded : True

PS C:\inetpub\wwwroot>
JiangtianLi commented 6 years ago

@pushkar-bitwise Can you use Get-HotFix on your windows node to see if you have KB4089848? If not, can you run windows update and manually install that KB? I am tracking to see when Azure VM will have that update.

pushkar-bitwise commented 6 years ago

I can see below details , can you please guide me how to install it manually

PS C:\data> Get-HotFix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
24560K8S9010  Hotfix           KB999999      NT AUTHORITY\SYSTEM  3/22/2018 12:00:00 AM
24560K8S9010  Security Update  KB4056892                          1/8/2018 12:00:00 AM
pushkar-bitwise commented 6 years ago

@JiangtianLi is below command ok to install manually

 Invoke-WebRequest http://download.windowsupdate.com/d/msdownload/update/software/updt/2018/03/windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d53170e84372b11.msu -Out windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d531
70e84372b11.msu

wusa.exe C:\data\windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d53170e84372b11.msu /quiet /norestart
pushkar-bitwise commented 6 years ago

@JiangtianLi installed KB , but still not internal DNS not working

madhanrm commented 6 years ago

@pushkar-bitwise Can you collect the traces again and pass it to us?

pushkar-bitwise commented 6 years ago

Thanks @madhanrm and @JiangtianLi, now DNS issue is resolved, i forgot to restart server, after restart everything working properly.

Thanks for your help.

roycornelissen commented 6 years ago

@madhanrm I've captured traces on both of my Windows Nodes, you can find them here:

https://kditridentstorage.blob.core.windows.net/assessments-public/Archive.zip

Is there already a separate issue for the remote endpoints? I realise this issue is being mixed with both internal and external DNS problems. The external DNS issues keeps coming back for me.

@pushkar-bitwise did you just install the kb4089848 hotfix and reboot the nodes or did you do anything else?

pushkar-bitwise commented 6 years ago

@roycornelissen i just installed kb4089848 using below command and rebooted the VM

 Invoke-WebRequest http://download.windowsupdate.com/d/msdownload/update/software/updt/2018/03/windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d53170e84372b11.msu -Out windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d531
70e84372b11.msu

wusa.exe C:\data\windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d53170e84372b11.msu /quiet /norestart
winterTTr commented 6 years ago

Thanks for the help from @JiangtianLi, so currently, my windows node is working, so just share what I have done, hope this can help someone, I am using the kubernetes 1.9 and acs-engine latest.

  1. Remote to the windows node, and Install the KB40089848 in the windows node

    Start-BitsTransfer http://download.windowsupdate.com/d/msdownload/update/software/updt/2018/03/windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d53170e84372b11.msu
    wusa windows10.0-kb4089848-x64_db7c5aad31c520c6983a937c3d53170e84372b11.msu

    This need to restart you vm.

  2. Create whatever pod, and deploy this pod on the windows node with a daemonset, which make sure that at least 1 pod is running on each windows node:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: network-daemon
      namespace: kube-system
      labels:
        k8s-app: network-daemon
    spec:
      selector:
        matchLabels:
          name: network-daemon-app
      template:
        metadata:
          labels:
            name: network-daemon-app
        spec:
          containers:
          - name: network-daemon
            image: your-simple-daemon-image-url
          nodeSelector:
            beta.kubernetes.io/os: windows
  3. Reset hns network settings

    Start-BitsTransfer -Source https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/windows/hns.psm1 -Destination C:\hns.psm1
    Import-Module -name C:\hns.psm1 -Verbose
    Stop-Service kubeproxy
    Stop-Service kubelet
    Get-HNSNetwork | ? Name -eq l2Bridge | Remove-HnsNetwork
    Get-HnsPolicyList | Remove-HnsPolicyList
    Start-Service kubelet
    Start-Service kubeproxy
mh101010 commented 6 years ago

I'm having the same problem with Azure ACS kubernetes (using Windows server 2016 as a node).

I've tried installing the proposed KB, but it errors with "the update is not applicable to your computer". Also, tried running the hns script, but Get-HNSNetwork isn't recognized.

Isn't there an official Microsoft solution for this issue?

JiangtianLi commented 6 years ago

@mh101010 what region did you install ACS? Just to confirm your Windows node is Windows Server 2016 (RS1), not 1709 (RS3)?

mh101010 commented 6 years ago

@JiangtianLi East US

madhanrm commented 6 years ago

@roycornelissen Can you copy the folder https://github.com/Microsoft/SDN/tree/master/Kubernetes/windows/debug and execute "powershell collectlogs.ps1"

I would like to look at the system state first before looking at the traces.

jbiel commented 6 years ago

@JiangtianLi - I am attempting to build a new Windows node image (I'd like to try KB4089848 to see if it fixes our issues) and have been using KB123456 and KB999999 (outside of acs-engine.) Now those patches fail to install because they're expired. Are those patches still needed? I see they're still specified in ./parts/k8s/kuberneteswindowssetup.ps1. Thanks.

Starting installation of hotfix KB123456

Checking for expiration of the hotfix
ERROR: The test signed hotfix you are trying to install has expired. Please contact Microsoft Support to get a newer version.

Closing the current window in 5 seconds
pushkar-bitwise commented 6 years ago

Hi @madhanrm & @JiangtianLi , we are facing issue after restart windows node is unreachable, look like it some code goes into loop, network connectivity goes on and off and after sometime around 30 min node is completely down. looks like below code from file https://github.com/Azure/acs-engine/blob/master/parts/k8s/kuberneteswindowssetup.ps1 creating issue

`

startup the service

`$hnsNetwork = Get-HnsNetwork | ? Name -EQ `$global:NetworkMode.ToLower()
if (`$hnsNetwork)
{
    # Kubelet has been restarted with existing network.
    # Cleanup all containers
    docker ps -q | foreach {docker rm `$_ -f}
    # cleanup network
    Write-Host "Cleaning up old HNS network found"
    Remove-HnsNetwork `$hnsNetwork
    Start-Sleep 10
}

`

pushkar-bitwise commented 6 years ago

@madhanrm & @JiangtianLi , Is it possible to provide custom windows image blob with patch already installed ? if yes can you please provide how it can be achived

JiangtianLi commented 6 years ago

@pushkar-bitwise I have a PR here https://github.com/Azure/acs-engine/pull/2532 but haven't fully validated it yet.

jdinard commented 6 years ago

I've installed the patch, and followed the directions from winterTTr, but my pods still can't resolve DNS names. Any more ideas as to why pods can't make any outgoing requests?

A simple C# console app cant run. Just trying to do a var client = new System.Net.WebClient(); client.OpenRead("http://www.google.com");

Results in host not found exceptions. The same console application runs fine in docker on the machine.

mingw2358 commented 6 years ago

I've also followed directions from winterTTr, and I'm having a different kind of problem. It looks like the patch is not compatible with windowsservercore-1709 images. After installed the patch, my pod (with base image microsoft/dotnet-framework:4.7.1-windowsservercore-1709) failed to pull down the image with the following errors:

Failed to pull image "xxxxx": rpc error: code = Unknown desc = failed to register layer: re-exec error: exit status 1: output: remove \?\C:\ProgramData\docker\windowsfilter\0376aaabd774941c332de5bf1c40bbb48c8df0ecaff35441973e9c8174c75885\UtilityVM\Files\Windows\WinSxS\amd64_microsoft-hyper-v-winsock-provider_31bf3856ad364e35_10.0.16299.15_none_fa874cf48b54cc18\wshhyperv.dll: Access is denied.

It seems to be related to this issue: https://github.com/moby/moby/issues/36092

jdinard commented 6 years ago

@mingw2358 That looks like a different issue. I was able to apply the patch on a new cluster that was provisioned using the Datacenter-Core-1709-with-Containers-smalldisk image.

I'm occasionally able to do an nslookup www.google.com on pods that have their dns servers manually set to the dns pods, but even then it generally fails.

At the moment DNS resolution is highly unreliable on the pods and fails 99% of the time.

I used acs-engine 14.6 with the attached settings: { "apiVersion": "vlabs", "properties": { "orchestratorProfile": { "orchestratorType": "Kubernetes" }, "masterProfile": { "count": 1, "dnsPrefix": "devWinKube", "vmSize": "Standard_D2_v2" }, "agentPoolProfiles": [ { "name": "windowspool2", "count": 2, "vmSize": "Standard_D2_v2", "availabilityProfile": "AvailabilitySet", "osType": "Windows" } ], "windowsProfile": { "adminUsername": "user", "adminPassword": "xxxxxxxxxxxxxxx" }, "linuxProfile": { "adminUsername": "user", "ssh": { "publicKeys": [ { "keyData": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" } ] } }, "servicePrincipalProfile": { "clientId": "xxxxxxxxxxxxxxxxxxxxxxx", "secret": "xxxxxxxxxxxxxxxxxxxxxx" } } }

Even after setting the DNS server, doing a simple ipconfig /displaydns fails with Windows being unable to display the DNS Resolver Cache.

After redeploying the cluster using the latest version of acs-engine, the pods can't even pull images. They fail with a dns lookup issue when trying to pull an image.

esheris commented 6 years ago

Hello, I have been following this for a while, posted on it a while back also. Just wondering when the azure image that gets deployed will include kb4089848. I have been spinning clusters up and down regularly for the past few weeks and I have to install this kb each time. It is quite time consuming.

JiangtianLi commented 6 years ago

@esheris Sorry for the delay and inconvenience. I will add kb in acs-engine soon.

esheris commented 6 years ago

Fairly serious question here. What is the full status of this issue? I am hoping to begin a migration into azure in the next few weeks but I can't really migrate with this issue as I need functional internal dns. Even applying the patch and following some recommendations above doesn't consistently resolve the issue. Does this issue exist in RS1? Is there a way to deploy an RS1 cluster with the current version of acs-engine?

croemmich commented 6 years ago

@esheris On a more serious note, I wouldn't run a production workload in Docker on Windows for at least 6 months. We've been having major issues causing down time a few times a week and Microsoft's support is absolutely terrible, as highlighted by how long this issue has been open. I regret our decision to move and may move back to traditional deployments. Just this morning 7/9 our nodes filled up their disks with docker tmp files which killed all 7 of the nodes almost simultaneously and caused 30 minutes of downtime. Seriously.. avoid like the plague.

As a side note, we've been running k8s workloads in Linux for about 2 years and Docker for 4, so we're fairly competent with the platform.

4c74356b41 commented 6 years ago

@croemmich why do you think 6 months shall suffice? the whole docker on windows journeys is couple years old already. hardly production grade stuff at this point, unfortunately.

croemmich commented 6 years ago

@4c74356b41 optimism, but you're probably correct sadly.

jbiel commented 6 years ago

The latest iteration of our cluster is still experiencing intermittent DNS issues. It doesn't just affect service IPs. The following configuration has been in place for ~4.5 days:

So, TMK we've got all of the suggested/speculated upon workarounds in place to no avail. Are there any other hopes/suggestions?

sam-cogan commented 6 years ago

Is there any further update (official or otherwise) on this? Even with the workaround posted DNS is still not working on 1709 nodes and it makes it pretty much unusable.

daschott commented 6 years ago

@sam-cogan @jbiel @croemmich Sorry for the issues you're having -- Kubernetes for Windows is still an ongoing/evolving story! Right now, the recommendation is to move to Windows Server, version 1803 since there were a lot of networking bug fixes, not all of which were backported to 1709. Otherwise, see this list.

SteveCurran commented 6 years ago

@daschott thank you very much. Great list of all the issues and workarounds. Can I create a new cluster via acs-engine and expect the windows nodes to be 1803

jsturtevant commented 6 years ago

@SteveCurran you should be able to add WindowsSku to the windowsProfile section:

"windowsProfile": {
      "adminUsername": "azureuser",
      "adminPassword": "",
      "WindowsSku": "Datacenter-Core-1803-with-Containers-smalldisk",
      "WindowsPublisher": "MicrosoftWindowsServer",
      "WindowsOffer": "WindowsServerSemiAnnual"
    },

After generating the template I found that this updates the parameters file (azuredeploy.parameters.json) with the correct value but the arm template that is generated uses a variable agentWindowsSku that does not get updated. As a workaround right now until it is fixed in acs-engine should be able to modify the azuredeploy.json arm template to 1803: "agentWindowsSku": "Datacenter-Core-1803-with-Containers-smalldisk"

4c74356b41 commented 6 years ago

Anybody else finding smalldisk os image for windows containers ridiculous?

SteveCurran commented 6 years ago

@jsturtevant I am getting a "unknown JSON tag WindowsSku" when generating from the api model. What version of acs-engine supports this or is this just in the canary?

sam-cogan commented 6 years ago

I was able to deploy an 1803 cluster with the following:

   "windowsProfile": {
      "adminUsername": "",
      "adminPassword": "",
      "imageVersion": "1803.0.20180504"
    },

Which then works out the SKU etc for you. I still then had to manually adjust the generated ARM template to change from 1709 to 1803.

PatrickLang commented 6 years ago

This issue has gone a bit off the rails here. Let me try to clarify a few things back up:

1) Windows Server version 1709 and Windows Server version 1803 both have the Windows fix required

2) A change to the Azure CNI plugin and acs-engine were needed to deploy a CNI fix. These changes are in master as of May 11, but not in an acs-engine release today.

So strictly speaking - you don't need Windows Server version 1803, and it hasn't been validated with acs-engine yet. We're still working on it (see #2965) but it's not ready quite yet.

If you build acs-engine master you can deploy 1709 and DNS works today.

sam-cogan commented 6 years ago

I've just deployed a 1709 cluster using master and the DNS issue persists, so it doesn't seem like it is fixed in that version.

jbiel commented 6 years ago

Our 1709 cluster with May updates still has issues. I'm working on updating our pipeline to build 1803 images and hope to have that in place soon.

sam-cogan commented 6 years ago

1803 won't work with ACS engine until #2976 fix gets merged unfortuantely

PatrickLang commented 6 years ago

@sam-cogan if you're still hitting problems with 1709, can you include more details: acs-engine version ipconfig /all output from a running pod

I'm not 100% sure we're looking at the same issue. When I was having DNS failures, ipconfig /all didn't list any DNS servers at all. Now I'm getting them on both 1709 and 1803

sam-cogan commented 6 years ago

I'm building the latest ACS engine from master, and I am getting DNS servers configured, so it may be a different issue. My problem is that the DNS servers cannot resolve any internal service DNS names, it can resolve external ones..

sam-cogan commented 6 years ago

I've made a bit of progress, but I don't really understand why. Below is the ipconfig from my pod, it has two DNS servers registered, if I do an NSLookup it defaults to using 168.63.129.16 and fails. If I force it to use 10.0.0.10 and I specify the full qualified name of the service app.default.svc.cluster.local then it will resolve it.

Which begs two questions

  1. What is 168.63.129.16 and why is it being set as primary DNS server
  2. Why will 10.0.0.10 only resolve withe the fully qualified name despite having a DNS suffix set for svc.cluster.local?
Windows IP Configuration

Ethernet adapter vEthernet (2fe598c1-eth0):

   Connection-specific DNS Suffix  . : svc.cluster.local
   Link-local IPv6 Address . . . . . : fe80::20ef:6be3:b17b:b6f6%27
   IPv4 Address. . . . . . . . . . . : 10.240.0.79
   Subnet Mask . . . . . . . . . . . : 255.240.0.0
   Default Gateway . . . . . . . . . : 10.240.0.1
PS C:\setup> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : tufted-aardwolf-app-deployment-7c976f6c9d-xhdjp
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : svc.cluster.local

Ethernet adapter vEthernet (2fe598c1-eth0):

   Connection-specific DNS Suffix  . : svc.cluster.local
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #4
   Physical Address. . . . . . . . . : 00-15-5D-FC-E3-32
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::20ef:6be3:b17b:b6f6%27(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.240.0.79(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.240.0.0
   Default Gateway . . . . . . . . . : 10.240.0.1
   DNS Servers . . . . . . . . . . . : 168.63.129.16
                                       10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled
sam-cogan commented 6 years ago

I've also deployed an 1803 cluster now and have the same problem, so not fixed there either. It seems it's actually worse, as I can't get it to resolve using 10.0.0.10 either, I have to specify the IP of the actual DNS pods.