The cluster-internal DNS server cannot be used from Windows containers

chweidling commented 6 years ago

Is this a request for help?: NO

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: canary, GitCommit 8fd4ac4267c29370091d98d80c3046bed517dd8c

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6

What happened:

I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.

Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.

What you expected to happen: Requests to the internal DNS server should not time out.

Steps to reproduce:

Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "kubernetesConfig": {
        "networkPolicy": "none"
      },
      "orchestratorRelease": "1.8"
    },
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "---",
      "vmSize": "Standard_D4s_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "backend",
        "count": 2,
        "osType": "Windows",
        "vmSize": "Standard_D4s_v3",
        "availabilityProfile": "AvailabilitySet"
      }      
    ],
    "windowsProfile": {
      "adminUsername": "---",
      "adminPassword": "---"
    },
    "linuxProfile": {
      "adminUsername": "weidling",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "ssh-rsa ---"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "---",
      "secret": "---"
    }
  }
}

Then run a Windows container. I used the following command: kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell

Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:

PS C:\> nslookup
DNS request timed out.
    timeout was 2 seconds.
Default Server:  UnKnown
Address:  10.0.0.10

> github.com
Server:  UnKnown
Address:  10.0.0.10

DNS request timed out.
    timeout was 2 seconds. 
(repeats 3 more times)
*** Request to UnKnown timed-out

> server 8.8.8.8
DNS request timed out.
    timeout was 2 seconds.
Default Server:  [8.8.8.8]
Address:  8.8.8.8

> github.com
Server:  [8.8.8.8]
Address:  8.8.8.8

Non-authoritative answer:
Name:    github.com
Addresses:  192.30.253.113
          192.30.253.112

> exit

Anything else we need to know: As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.

I observed the behavior independent from the values of the networkPolicy (none, azure) and orchestratorRelease (1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod:

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : mycore-96fdd75dc-8g5kd
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter vEthernet (9519cc22abb5ef39c786c5fbdce98c6a23be5ff1dced650ed9e338509db1eb35_l2bridge):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
   Physical Address. . . . . . . . . : 00-15-5D-87-0F-CC
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::a58c:aaf:c12b:d82c%21(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.244.2.92(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1
   DNS Servers . . . . . . . . . . . : 10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled

JiangtianLi commented 6 years ago

@4c74356b41 Looks like the patch is effective. Did DNS work in that container before and stopped working or it is a new container never having DNS working? If you don't need that pod, can you try (with https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/hns.psm1)

Stop-Service kubeproxy
Stop-Service kubelet
Get-HnsNetworks | ? Name -eq l2Bridge | Remove-HnsNetwork 
Get-HnsPolicyList | Remove-HnsPolicyList
Start-Service kubelet
Start-Service kubeproxy

and create a new POD? If the POD was working before but stopped working (without node rebooting), then I'll need to loop in networking team and collect trace.

4c74356b41 commented 6 years ago

does deleting this pod and deployment creating a new one in its place count as a new pod? i've had both situations. pods losing dns and pods not having dns from the get go. but not sure i've seen pods losing internet without node rebooting.

ok, that did the trick. pod has the internet now.

I've seen pods not getting internet randomly at startup (when all the other pods had it). that i did see.

so should I create a startup script that does this and add to both nodes?

JiangtianLi commented 6 years ago

@4c74356b41 Pod created after deleting is also what I meant by "new pod". If the POD is running and DNS stops working (without node rebooting), then that is something new.

msorby commented 6 years ago

@4c74356b41 I'm seeing this today. Had a pod with dns working, then all of a sudden it stopped working. Without the node rebooting. Killing the pod so it creates a new did not help. Then on one recreation of the pod it's started to work again. No idea why.

@JiangtianLi I'm pretty sure I'm seeing the pattern where it just stop working, but I'm not 100% sure if it was after deleing a pod and it got recreated. I'll pay attention to the pattern.

//Morten

JiangtianLi commented 6 years ago

@msorby Thanks for reporting. I'll report to networking team.

JiangtianLi commented 6 years ago

/cc @madhanrm

SteveCurran commented 6 years ago

Created new cluster today using acs-engine 0.13.1 kubenetes 1.9.1, all containers microsoft/windowsservercore:1709_KB4074588. No internal DNS resolution.

qmalexander commented 6 years ago

@JiangtianLi any status on this?

4c74356b41 commented 6 years ago

@qmalexander deploy from the pr @JiangtianLi mentioned and it appears working

JiangtianLi commented 6 years ago

@qmalexander If your issue is that internal DNS not working, can you follow the steps below and let me know the results?

kubectl get po --all-namespaces to find the POD IP of kube-dns
inside windows container, Resolve-DnsName www.bing.com
inside windows container, ipconfig /all
inside windows container, Test-NetConnection 10.0.0.10 -Port 53
inside windows container, Test-NetConnection <kube-dns pod ip> -Port 53

roycornelissen commented 6 years ago

Really annoying problem... I have deployed a 1.9.3 cluster with acs-engine 0.13.1 this morning. On one Windows node, external DNS works, but on the other node, pods just cannot resolve any external addresses. Internal DNS doesn't work at all... Is the root cause for this known and being addressed?

Any tips for which version of K8s and which version of acs-engine actually works?

JiangtianLi commented 6 years ago

@roycornelissen There is a known issue with DNS after windows node reboots and we are working on a fix https://github.com/Azure/acs-engine/pull/2378. Can you use the steps in this thread to mitigate? Pasted here:

On windows node:
import-module from https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/hns.psm1
Stop-Service kubeproxy
Stop-Service kubelet
Get-HnsNetworks | ? Name -eq l2Bridge | Remove-HnsNetwork 
Get-HnsPolicyList | Remove-HnsPolicyList
Start-Service kubelet
Start-Service kubeproxy

roycornelissen commented 6 years ago

@JiangtianLi Thanks! This seems to fix the external DNS issue at least. Internal DNS is still not working properly, but for now I'm happy if at least external DNS is working.

4c74356b41 commented 6 years ago

@roycornelissen if your internal dns isnt working something is wrong with the cluster most likely. check if kube-dns is running. check your route tables. check if k8s got permissions to Azure.

qmalexander commented 6 years ago

@JiangtianLi I got it working with your fix. Thanx!

I created a simple ps1 script. https://github.com/qmalexander/awn/blob/master/acs_fix_external_communication.ps1

Start-BitsTransfer -Source https://raw.githubusercontent.com/qmalexander/awn/master/acs_fix_external_communication.ps1 -Destination C:\acs_fix_external_communication.ps1`

Now go ahead and run:

.\acs_fix_external_communication.ps1

roycornelissen commented 6 years ago

@qmalexander Very useful script, thanks for sharing!

@4c74356b41 kube-dns is running, and the SP the cluster runs under is Owner of the resource group. Just deployed a cluster fresh out of the box. Internal DNS on Linux nodes works fine, but not on Windows. This is not hugely blocking for me though, just a little inconvenient. I'm really happy I got the external DNS working again. I'll check my route tables, but I'm unsure how they would have ended up wrong right after deploying the cluster and a few pods...

qmalexander commented 6 years ago

@roycornelissen I noticed that I had to add svc.cluster.local if I where to use the dns internal for my service call.

my-svc.my-namespace.svc.cluster. local

roycornelissen commented 6 years ago

Fix doesn't seem completely consistent... I still had some pods with random DNS issues. Now rebooted both Windows nodes, applied the fix and redeployed my pods and it works for now...

roycornelissen commented 6 years ago

@qmalexander works for resolving services that run on Linux nodes, but not for my service that runs on a Windows node... Just noticed that using the cluster IP, it also won't work. Unable to connect to remote server.

qmalexander commented 6 years ago

@roycornelissen strange.. What base image are you using for the docker containers?

roycornelissen commented 6 years ago

@qmalexander I have a combination of different images:

microsoft/windowsservercore:1709 microsoft/aspnet:4.7.1-windowsservercore-1709 microsoft/dotnet-framework:4.7.1-windowsservercore-1709

Currently, the latter 2 are running in my cluster.

qmalexander commented 6 years ago

@roycornelissen try with windowsservercore/1709_KB4074588

zhech2 commented 6 years ago

I am using acs-engine to deploy 1.9.1 of kubernetes as described in this thread and am seeing errors with any service I deploy as follows:

Error creating load balancer (will retry): error getting LB for service default/kubernetes: Service(default/kubernetes) - Loadbalancer not found

This example is of the kubernetes service but all are the same.

Is anyone else running into this issue?

feiskyer commented 6 years ago

error getting LB for service default/kubernetes: Service(default/kubernetes) - Loadbalancer not found

This is a misleading message, which could be omit safely.

4c74356b41 commented 6 years ago

This example is of the kubernetes service but all are the same. - if they all are failing with the same message check sp credentials\permissions. but this one doesnt matter indeed.

jimfim commented 6 years ago

Hey Folks, I'm experiencing this issue on a newly created 1.9.3 hybrid cluster.

On the windows node I ran the script posted by @qmalexander This is the output

PS C:\> cat .\acs_fix_external_communication.ps1
Start-BitsTransfer -Source https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/windows/hns.psm1 -Destination C:\hns.psm1
Import-Module -name C:\hns.psm1 -Verbose
Stop-Service kubeproxy
Stop-Service kubelet
Get-HNSNetwork | ? Name -eq l2Bridge | Remove-HnsNetwork
Get-HnsPolicyList | Remove-HnsPolicyList
Start-Service kubelet
Start-Service kubeproxy

PS C:\> .\acs_fix_external_communication.ps1
VERBOSE: Loading module from path 'C:\hns.psm1'.
WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less
discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose
parameter. For a list of approved verbs, type Get-Verb.
VERBOSE: The 'Attach-HNSEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Add,Debug".
VERBOSE: Importing function 'Attach-HNSEndpoint'.
VERBOSE: The 'Attach-HnsHostEndpoint' command in the hns' module was imported, but because its name does not include an
 approved verb, it might be difficult to find. The suggested alternative verbs are "Add,Debug".
VERBOSE: Importing function 'Attach-HnsHostEndpoint'.
VERBOSE: The 'Attach-HNSVMEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Add,Debug".
VERBOSE: Importing function 'Attach-HNSVMEndpoint'.
VERBOSE: The 'Detach-HNSEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Dismount,Remove".
VERBOSE: Importing function 'Detach-HNSEndpoint'.
VERBOSE: The 'Detach-HNSHostEndpoint' command in the hns' module was imported, but because its name does not include an
 approved verb, it might be difficult to find. The suggested alternative verbs are "Dismount,Remove".
VERBOSE: Importing function 'Detach-HNSHostEndpoint'.
VERBOSE: The 'Detach-HNSVMEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Dismount,Remove".
VERBOSE: Importing function 'Detach-HNSVMEndpoint'.
VERBOSE: Importing function 'Get-HNSActivities'.
VERBOSE: Importing function 'Get-HNSPolicyList'.
VERBOSE: Importing function 'Get-HnsSwitchExtensions'.
VERBOSE: Importing function 'Invoke-HNSRequest'.
VERBOSE: Importing function 'New-HnsEndpoint'.
VERBOSE: Importing function 'New-HnsLoadBalancer'.
VERBOSE: Importing function 'New-HnsNetwork'.
VERBOSE: Importing function 'New-HnsRemoteEndpoint'.
VERBOSE: Importing function 'New-HnsRoute'.
VERBOSE: Importing function 'Remove-HnsPolicyList'.
VERBOSE: Importing function 'Set-HnsSwitchExtension'.

PS C:\> Test-NetConnection 10.0.0.10 -Port 53
WARNING: TCP connect to (10.0.0.10 : 53) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut

ComputerName           : 10.0.0.10
RemoteAddress          : 10.0.0.10
RemotePort             : 53
InterfaceAlias         : vEthernet (Ethernet 2)
SourceAddress          : 10.240.0.4
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

this is IP address of a kube DNS pod in the cluster
PS C:\> Test-NetConnection 10.244.0.4 -Port 53
ComputerName     : 10.244.0.4
RemoteAddress    : 10.244.0.4
RemotePort       : 53
InterfaceAlias   : vEthernet (Ethernet 2)
SourceAddress    : 10.240.0.4
TcpTestSucceeded : True

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : 38092k8s9010
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : j0eptdncmm0unmdk0itzjxs3gb.bx.internal.cloudapp.net

Ethernet adapter vEthernet (Ethernet 2):

   Connection-specific DNS Suffix  . : j0eptdncmm0unmdk0itzjxs3gb.bx.internal.cloudapp.net
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #2
   Physical Address. . . . . . . . . : 00-0D-3A-19-3D-CE
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::d58a:6bb2:a35:6900%13(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.240.0.4(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Lease Obtained. . . . . . . . . . : Thursday, March 15, 2018 10:14:25 AM
   Lease Expires . . . . . . . . . . : Sunday, April 21, 2154 4:46:25 PM
   Default Gateway . . . . . . . . . : 10.240.0.1
   DHCP Server . . . . . . . . . . . : 168.63.129.16
   DHCPv6 IAID . . . . . . . . . . . : 218107194
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-22-3A-AF-3E-00-15-5D-78-11-88
   DNS Servers . . . . . . . . . . . : 168.63.129.16
   NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter vEthernet (nat):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
   Physical Address. . . . . . . . . : 00-15-5D-01-99-45
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::fc64:6aeb:16a1:8763%8(Preferred)
   IPv4 Address. . . . . . . . . . . : 172.26.176.1(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :
   DHCPv6 IAID . . . . . . . . . . . : 167777629
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-22-3A-AF-3E-00-15-5D-78-11-88
   DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                       fec0:0:0:ffff::2%1
                                       fec0:0:0:ffff::3%1
   NetBIOS over Tcpip. . . . . . . . : Enabled

After running the above, I created a new microsoft/windowsservercore:1709_KB4088776 pod on the cluster To test the connection i spun up an elastic search pod behind a service "search" exposing port 9200 the service ip : 10.0.96.72 the elastic search pod ip : 10.244.0.11

These are results of my tests


PS C:\> nslookup www.bing.com
DNS request timed out.
    timeout was 2 seconds.
Server:  UnKnown
Address:  10.0.0.10

DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out

PS C:\> Test-NetConnection 10.0.0.10 -Port 53
WARNING: TCP connect to (10.0.0.10 : 53) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut

ComputerName           : 10.0.0.10
RemoteAddress          : 10.0.0.10
RemotePort             : 53
InterfaceAlias         : vEthernet (0a966976b45b1a7bca116dd28f1add0adb687f50a4485e25de3460da20f7bc43_l2bridge)
SourceAddress          : 10.244.2.211
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

this is IP address of a kube DNS pod in the cluster
PS C:\> Test-NetConnection 10.244.0.4 -Port 53

ComputerName     : 10.244.0.4
RemoteAddress    : 10.244.0.4
RemotePort       : 53
InterfaceAlias   : vEthernet (0a966976b45b1a7bca116dd28f1add0adb687f50a4485e25de3460da20f7bc43_l2bridge)
SourceAddress    : 10.244.2.211
TcpTestSucceeded : True

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : dep-util-5f989584db-g6kzp
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter vEthernet (0a966976b45b1a7bca116dd28f1add0adb687f50a4485e25de3460da20f7bc43_l2bridge):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
   Physical Address. . . . . . . . . : 00-15-5D-51-38-1A
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::e0c8:a32e:6c75:16ac%21(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.244.2.211(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1
   DNS Servers . . . . . . . . . . . : 10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled

PS C:\> (curl http://search:9200 -usebasicparsing).Content
curl : The remote name could not be resolved: 'search'
At line:1 char:2
+ (curl http://search:9200 -usebasicparsing).Content
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

PS C:\> (curl http://10.244.0.11:9200 -usebasicparsing).Content
{
  "name" : "2okWBvo",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "VoGkVbYgSAG200E7wbwMGA",
  "version" : {
    "number" : "5.6.5",
    "build_hash" : "6a37571",
    "build_date" : "2017-12-04T07:50:10.466Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

PS C:\> (curl http://10.0.96.72:9200 -usebasicparsing).Content
curl : Unable to connect to the remote server
At line:1 char:2
+ (curl http://10.0.96.72:9200 -usebasicparsing).Content
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

It only works in the above instance if i specify the IP address of the Elastic search Pod After this i spun up a pod on the Linux node to test connectivity from there

bash-4.4# wget http://search:9200
Connecting to search:9200 (10.0.96.72:9200)
index.html           100% |********************************************************************************************************************************************************************************************|   327   0:00:00 ETA
bash-4.4# cat index.html
{
  "name" : "2okWBvo",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "VoGkVbYgSAG200E7wbwMGA",
  "version" : {
    "number" : "5.6.5",
    "build_hash" : "6a37571",
    "build_date" : "2017-12-04T07:50:10.466Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

bash-4.4# ifconfig
eth0      Link encap:Ethernet  HWaddr 0A:58:0A:F4:00:0B
          inet addr:10.244.0.11  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::60d8:44ff:fe06:fa4c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:277 errors:0 dropped:0 overruns:0 frame:0
          TX packets:95 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:20106 (19.6 KiB)  TX bytes:9762 (9.5 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:48 errors:0 dropped:0 overruns:0 frame:0
          TX packets:48 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2880 (2.8 KiB)  TX bytes:2880 (2.8 KiB)

this seems to work perfectly

ACS Version I have installed

PS C:\> acs-engine.exe version
Version: v0.14.0
GitCommit: 6ba4622b
GitTreeState: clean

I am very interested in getting a hybrid cluster up and running. I hope my above logs help in resolving the issue

roycornelissen commented 6 years ago

@4c74356b41 what should the proper permissions for the SP be? In my case it's Owner in the RG where the cluster exists, but I am indeed also seeing those "cannot create LoadBalancer" warnings.

4c74356b41 commented 6 years ago

Contributor is fine.

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

From: Roy Cornelissen notifications@github.com Sent: Thursday, March 15, 2018 4:27:35 PM To: Azure/acs-engine Cc: 4c74356b41; Mention Subject: Re: [Azure/acs-engine] The cluster-internal DNS server cannot be used from Windows containers (#2027)

@4c74356b41https://github.com/4c74356b41 what should the proper permissions for the SP be? In my case it's Owner in the RG where the cluster exists, but I am indeed also seeing those "cannot create LoadBalancer" warnings.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Azure/acs-engine/issues/2027#issuecomment-373374987, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APccUyyscW3EELuC0q4T1xUS3QeZWknuks5temxHgaJpZM4RZzhA.

roycornelissen commented 6 years ago

@4c74356b41 OK thanks.

@JiangtianLi Unfortunately the fix is not persistent... A pod that was running fine yesterday has now lost DNS again (with no changes, restarts or redeployments) :(

zhech2 commented 6 years ago

I am having the same issue with DNS. It seems completely random if a deployment will work or not.

JiangtianLi commented 6 years ago

/cc @madhanrm

Update for the random DNS issue, we have root caused the issue: The outbound connection would be broken ONLY when a single container POD is created and then deleted on this node (ie no other pod exists on this node). The outbound connection would automatically restore if a new policy is pushed to kubeproxy on this node, which can happen under any one of following conditions • A new service is deployed, which has service vip backed by pod ip • Restart Kubeproxy

However, The issue WILL NOT be seen if there are more than one PODs deployed on this node.

As a mitigation, please deploy an extra POD as DaemonSet on each node. We are working on a fix.

roycornelissen commented 6 years ago

@JiangtianLi After a clean start of the nodes, applying the hns script workaround and deploying a "placeholder" pod as a deamon set, it looks like I indeed have DNS restored on both Windows nodes with each new pod.

Small addition: it seems that this deamonset pod must be the first one deployed on the node, only after that do other pods have external DNS. My dummy pods still don't have DNS, but new ones do. Fingers crossed.

Anyway, thanks for putting so much effort in getting to the bottom of this.

@JiangtianLi EDIT March 19th: Still seeing new pods with no external DNS, sadly... Still not sure what triggers it. My environment is pretty dynamic with pods being started and torn down quite often, but then again... that's what Kubernetes is for :)

y325A commented 6 years ago

Sorry to be another person in this thread with problems! I can use the cluster DNS server through its Pod IP, but not through its service IP.

I have deployed a new cluster this morning using ACS 0.14.1 and K8s 1.9.4

I have tried the reset HNS script above and have spun up multiple Pods as @JiangtianLi mentioned above.

Here are my results from nslookup inside the pods:

PS C:\> nslookup
DNS request timed out.
    timeout was 2 seconds.
Default Server:  UnKnown
Address:  10.0.0.10

> github.com
Server:  UnKnown
Address:  10.0.0.10

DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out
> server 10.244.0.10
DNS request timed out.
    timeout was 2 seconds.
Default Server:  [10.244.0.10]
Address:  10.244.0.10

> github.com
Server:  [10.244.0.10]
Address:  10.244.0.10

Non-authoritative answer:
Name:    github.com
Addresses:  192.30.253.113
          192.30.253.112

I get the same results from my Windows host.

I can reach my DNS service from my Linux master:

> server 10.0.0.10
Default server: 10.0.0.10
Address: 10.0.0.10#53
> google.co.uk
Server:         10.0.0.10
Address:        10.0.0.10#53

Name:   google.co.uk
Address: 172.217.17.35

zhech2 commented 6 years ago

@JiangtianLi DNS is still not working for me as well. Here is how I have setup my cluster:

Installed kubernetes 1.9.1
Switched all base containers to microsoft/windowsservercore:1709_KB407458
Restarted all nodes
Performed hns fix to all nodes
Deployed dummy pods using DaemonSet after restarting kubelet and kubeproxy

Result: Internal DNS and external DNS are very random if/when it works.

I run the following script as part of my dockerfile to assess if things are working

Set-DnsClientGlobalSetting -SuffixSearchList @("${env:NAMESPACE}.svc.cluster.local", "svc.cluster.local")
Get-DnsClientGlobalSetting

nslookup www.bing.com
nslookup database
Test-NetConnection 10.0.0.10 -Port 53
ipconfig /all

try { $c = New-Object System.Net.Sockets.TcpClient("database", 1433); $c.close(); write-host 'Connected to database' } catch { write-host 'Failed to connect to database' }
try { $c = New-Object System.Net.Sockets.TcpClient("10.240.0.6", 1433); $c.close(); write-host 'Connected to 10.240.0.6' } catch { write-host 'Failed to connect to 10.240.0.6' }

The output of the script is essentially the same as others have reported. Requests time out or the application or script fails to connect to the database.

The 10.240.0.6 is an external database on the network.
"database" is a service with type ExternalName to the database.
Set-DnsClientGlobalSetting with the suffix list is for connecting to the service without having to specify the full name, service.namespace.svc.cluster.local

Thank you for all your help on this issue.

jbiel commented 6 years ago

Can someone from the Microsoft/Azure team please comment on the recommendation of using microsoft/windowsservercore:1709_KB4074588 as the base image for containers? This keeps popping up and my understanding is that the current microsoft/windowsservercore:1709 tag has that KB patch so it shouldn't be necessary to use the KB-specific tag.

PatrickLang commented 6 years ago

:1709 is sort of like latest. Each time a new cumulative update is released, :1709 is tagged to the newest update

1709_kb??????? is a durable tag if you want to manually choose when to update your container

jbiel commented 6 years ago

@PatrickLang - yes, understood. My question is whether or not the current 1709 tag has KB4074588 (released February 13, 2018) included. I'm assuming it does since the latest build was 6 days ago and it was built ~1 month after KB4074588 was released. We're struggling with intermittent DNS/service IP issues on containers running the current 1709 tag so I'm trying to make sure our environment is setup properly. Thanks.

PatrickLang commented 6 years ago

You can also see the Windows reversion numbers in docker history

docker history microsoft/windowsservercore:1709

IMAGE               CREATED             CREATED BY                      SIZE                COMMENT
6e8857bf419a        2 weeks ago         Install update 10.0.16299.309   1.59GB
<missing>           5 months ago        Apply image 10.0.16299.15       4.62GB

Those are continually increasing, and all updates are cumulative. This page has a decoder ring to go from revision to KB: https://support.microsoft.com/en-us/help/4043454

PatrickLang commented 6 years ago

also - docker inspect windowsservercore:1709 has an OsVersion property

qmalexander commented 6 years ago

@JiangtianLi I have done all that you suggest. After a redeploy + script + deamonset the pods have internal and external communication. But after a while it suddenly stops and now there is no communication.... If i run the hns script and redeploy my pods. All works fine again...

@JiangtianLi are you aware of this? Will the fix your working on solve this to? Eta?

JiangtianLi commented 6 years ago

@qmalexander I've forwarded your issue to the team.

JiangtianLi commented 6 years ago

@qmalexander if possible, can you follow:

https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/debug/startpacketcapture.cmd

<Try Test-Connection, until it fails>

https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/debug/stoppacketcapture.cmd

And share c:\server.etl somewhere?

roycornelissen commented 6 years ago

Is there an ETA for a resolution of these DNS issues? It's getting quite painful to do our current project. DNS or no DNS is like a lottery :(

SteveCurran commented 6 years ago

@roycornelissen I moved ahead with my project by continuing to use the workaround for setting the dns addresses in the start up of my windows containers. Version 1.9 fixed the periodic external DNS issues caused by a race condition, but you still have to use the workaround. I just pass in both internal addresses to the pod as arguments. DNS is working consistently on the pods.

$adapter=get-netadapter Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses $dns1,$dns2 Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"

zhech2 commented 6 years ago

@JiangtianLi When I run the pod it will randomly come up with DNS, no network connectivity at all or no DNS and can still connect to other servers.

I am willing to get a capture for you. Would I run the startpacketcapture.cmd on one of the nodes and start a pod until it has a problem and then send server.etl to you?

Also, we are willing to give you access to our test cluster if that will help. Thanks.

zhech2 commented 6 years ago

@SteveCurran Thanks for your workaround. It seems to work 90% of the time whereas before it was working more like 10% of the time. Out of curiosity how many pods are you running? We are running around 30.

JiangtianLi commented 6 years ago

@zhech2 Yes, we want to capture the trace when the issue happens. The size of trace may be huge if you run for a long time though. Thank you for reporting the issue to help us investigate. I'll also loop back with the team.

SteveCurran commented 6 years ago

@zhech2 I am just running two nodes.

madhanrm commented 6 years ago

@zhech2 First I would like to know the state of the system. Can you copy the folder https://github.com/Microsoft/SDN/tree/master/Kubernetes/windows/debug and execute "powershell collectlogs.ps1" on one of the problematic host and pass me the link to those files? You can also get a packet capture on that host, by running a simple Test-Connection or nslookup inside the container, when POD is in bad state

j03wang commented 6 years ago

I recently rebuilt a cluster and encountered this exact issue of not being able to connect to service IPs (and therefore no DNS) -- my current setup actually does work, although I haven't tested it with more than a few pods.

Here's what I did:

Built acs-engine@4468670 (using k8s 1.9.5 and including https://github.com/Azure/acs-engine/pull/2378)
Deployed a hybrid cluster
Manually built kubletwin/pause on the windows nodes (https://github.com/Azure/acs-engine/issues/2446)
Manually installed KB4089848 (notice that KB123456 no longer installs because it's expired)
Reboot the windows nodes
If metrics-server is enabled, edited the Deployment to target linux nodes (https://github.com/Azure/acs-engine/issues/2508) - I was getting docker "failed to register layer" errors on the windows nodes on which this was trying to run.
Deploy a DaemonSet of pods and associate a headless service to the pods. In my case, I'm deploying the prometheus pushgateway so I'll at least get some use out of them.
Deploy the pod you actually want to run.

Azure / acs-engine

The cluster-internal DNS server cannot be used from Windows containers #2027