Closed chweidling closed 5 years ago
@4c74356b41 Looks like the patch is effective. Did DNS work in that container before and stopped working or it is a new container never having DNS working? If you don't need that pod, can you try (with https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/hns.psm1)
Stop-Service kubeproxy
Stop-Service kubelet
Get-HnsNetworks | ? Name -eq l2Bridge | Remove-HnsNetwork
Get-HnsPolicyList | Remove-HnsPolicyList
Start-Service kubelet
Start-Service kubeproxy
and create a new POD? If the POD was working before but stopped working (without node rebooting), then I'll need to loop in networking team and collect trace.
does deleting this pod and deployment creating a new one in its place count as a new pod? i've had both situations. pods losing dns and pods not having dns from the get go. but not sure i've seen pods losing internet without node rebooting.
ok, that did the trick. pod has the internet now.
I've seen pods not getting internet randomly at startup (when all the other pods had it). that i did see.
so should I create a startup script that does this and add to both nodes?
@4c74356b41 Pod created after deleting is also what I meant by "new pod". If the POD is running and DNS stops working (without node rebooting), then that is something new.
@4c74356b41 I'm seeing this today. Had a pod with dns working, then all of a sudden it stopped working. Without the node rebooting. Killing the pod so it creates a new did not help. Then on one recreation of the pod it's started to work again. No idea why.
@JiangtianLi I'm pretty sure I'm seeing the pattern where it just stop working, but I'm not 100% sure if it was after deleing a pod and it got recreated. I'll pay attention to the pattern.
//Morten
@msorby Thanks for reporting. I'll report to networking team.
/cc @madhanrm
Created new cluster today using acs-engine 0.13.1 kubenetes 1.9.1, all containers microsoft/windowsservercore:1709_KB4074588. No internal DNS resolution.
@JiangtianLi any status on this?
@qmalexander deploy from the pr @JiangtianLi mentioned and it appears working
@qmalexander If your issue is that internal DNS not working, can you follow the steps below and let me know the results?
kubectl get po --all-namespaces
to find the POD IP of kube-dnsResolve-DnsName www.bing.com
ipconfig /all
Test-NetConnection 10.0.0.10 -Port 53
Test-NetConnection <kube-dns pod ip> -Port 53
Really annoying problem... I have deployed a 1.9.3 cluster with acs-engine 0.13.1 this morning. On one Windows node, external DNS works, but on the other node, pods just cannot resolve any external addresses. Internal DNS doesn't work at all... Is the root cause for this known and being addressed?
Any tips for which version of K8s and which version of acs-engine actually works?
@roycornelissen There is a known issue with DNS after windows node reboots and we are working on a fix https://github.com/Azure/acs-engine/pull/2378. Can you use the steps in this thread to mitigate? Pasted here:
On windows node:
import-module from https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/hns.psm1
Stop-Service kubeproxy
Stop-Service kubelet
Get-HnsNetworks | ? Name -eq l2Bridge | Remove-HnsNetwork
Get-HnsPolicyList | Remove-HnsPolicyList
Start-Service kubelet
Start-Service kubeproxy
@JiangtianLi Thanks! This seems to fix the external DNS issue at least. Internal DNS is still not working properly, but for now I'm happy if at least external DNS is working.
@roycornelissen if your internal dns isnt working something is wrong with the cluster most likely. check if kube-dns is running. check your route tables. check if k8s got permissions to Azure.
@JiangtianLi I got it working with your fix. Thanx!
I created a simple ps1 script. https://github.com/qmalexander/awn/blob/master/acs_fix_external_communication.ps1
Start-BitsTransfer -Source https://raw.githubusercontent.com/qmalexander/awn/master/acs_fix_external_communication.ps1 -Destination C:\acs_fix_external_communication.ps1`
Now go ahead and run:
.\acs_fix_external_communication.ps1
@qmalexander Very useful script, thanks for sharing!
@4c74356b41 kube-dns is running, and the SP the cluster runs under is Owner of the resource group. Just deployed a cluster fresh out of the box. Internal DNS on Linux nodes works fine, but not on Windows. This is not hugely blocking for me though, just a little inconvenient. I'm really happy I got the external DNS working again. I'll check my route tables, but I'm unsure how they would have ended up wrong right after deploying the cluster and a few pods...
@roycornelissen I noticed that I had to add svc.cluster.local if I where to use the dns internal for my service call.
my-svc.my-namespace.svc.cluster. local
Fix doesn't seem completely consistent... I still had some pods with random DNS issues. Now rebooted both Windows nodes, applied the fix and redeployed my pods and it works for now...
@qmalexander works for resolving services that run on Linux nodes, but not for my service that runs on a Windows node... Just noticed that using the cluster IP, it also won't work. Unable to connect to remote server.
@roycornelissen strange.. What base image are you using for the docker containers?
@qmalexander I have a combination of different images:
microsoft/windowsservercore:1709 microsoft/aspnet:4.7.1-windowsservercore-1709 microsoft/dotnet-framework:4.7.1-windowsservercore-1709
Currently, the latter 2 are running in my cluster.
@roycornelissen try with windowsservercore/1709_KB4074588
I am using acs-engine to deploy 1.9.1 of kubernetes as described in this thread and am seeing errors with any service I deploy as follows:
Error creating load balancer (will retry): error getting LB for service default/kubernetes: Service(default/kubernetes) - Loadbalancer not found
This example is of the kubernetes service but all are the same.
Is anyone else running into this issue?
error getting LB for service default/kubernetes: Service(default/kubernetes) - Loadbalancer not found
This is a misleading message, which could be omit safely.
This example is of the kubernetes service but all are the same.
- if they all are failing with the same message check sp credentials\permissions. but this one doesnt matter indeed.
Hey Folks, I'm experiencing this issue on a newly created 1.9.3 hybrid cluster.
On the windows node I ran the script posted by @qmalexander This is the output
PS C:\> cat .\acs_fix_external_communication.ps1
Start-BitsTransfer -Source https://raw.githubusercontent.com/Microsoft/SDN/master/Kubernetes/windows/hns.psm1 -Destination C:\hns.psm1
Import-Module -name C:\hns.psm1 -Verbose
Stop-Service kubeproxy
Stop-Service kubelet
Get-HNSNetwork | ? Name -eq l2Bridge | Remove-HnsNetwork
Get-HnsPolicyList | Remove-HnsPolicyList
Start-Service kubelet
Start-Service kubeproxy
PS C:\> .\acs_fix_external_communication.ps1
VERBOSE: Loading module from path 'C:\hns.psm1'.
WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less
discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose
parameter. For a list of approved verbs, type Get-Verb.
VERBOSE: The 'Attach-HNSEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Add,Debug".
VERBOSE: Importing function 'Attach-HNSEndpoint'.
VERBOSE: The 'Attach-HnsHostEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Add,Debug".
VERBOSE: Importing function 'Attach-HnsHostEndpoint'.
VERBOSE: The 'Attach-HNSVMEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Add,Debug".
VERBOSE: Importing function 'Attach-HNSVMEndpoint'.
VERBOSE: The 'Detach-HNSEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Dismount,Remove".
VERBOSE: Importing function 'Detach-HNSEndpoint'.
VERBOSE: The 'Detach-HNSHostEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Dismount,Remove".
VERBOSE: Importing function 'Detach-HNSHostEndpoint'.
VERBOSE: The 'Detach-HNSVMEndpoint' command in the hns' module was imported, but because its name does not include an
approved verb, it might be difficult to find. The suggested alternative verbs are "Dismount,Remove".
VERBOSE: Importing function 'Detach-HNSVMEndpoint'.
VERBOSE: Importing function 'Get-HNSActivities'.
VERBOSE: Importing function 'Get-HNSPolicyList'.
VERBOSE: Importing function 'Get-HnsSwitchExtensions'.
VERBOSE: Importing function 'Invoke-HNSRequest'.
VERBOSE: Importing function 'New-HnsEndpoint'.
VERBOSE: Importing function 'New-HnsLoadBalancer'.
VERBOSE: Importing function 'New-HnsNetwork'.
VERBOSE: Importing function 'New-HnsRemoteEndpoint'.
VERBOSE: Importing function 'New-HnsRoute'.
VERBOSE: Importing function 'Remove-HnsPolicyList'.
VERBOSE: Importing function 'Set-HnsSwitchExtension'.
PS C:\> Test-NetConnection 10.0.0.10 -Port 53
WARNING: TCP connect to (10.0.0.10 : 53) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut
ComputerName : 10.0.0.10
RemoteAddress : 10.0.0.10
RemotePort : 53
InterfaceAlias : vEthernet (Ethernet 2)
SourceAddress : 10.240.0.4
PingSucceeded : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded : False
this is IP address of a kube DNS pod in the cluster
PS C:\> Test-NetConnection 10.244.0.4 -Port 53
ComputerName : 10.244.0.4
RemoteAddress : 10.244.0.4
RemotePort : 53
InterfaceAlias : vEthernet (Ethernet 2)
SourceAddress : 10.240.0.4
TcpTestSucceeded : True
PS C:\> ipconfig /all
Windows IP Configuration
Host Name . . . . . . . . . . . . : 38092k8s9010
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
DNS Suffix Search List. . . . . . : j0eptdncmm0unmdk0itzjxs3gb.bx.internal.cloudapp.net
Ethernet adapter vEthernet (Ethernet 2):
Connection-specific DNS Suffix . : j0eptdncmm0unmdk0itzjxs3gb.bx.internal.cloudapp.net
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #2
Physical Address. . . . . . . . . : 00-0D-3A-19-3D-CE
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::d58a:6bb2:a35:6900%13(Preferred)
IPv4 Address. . . . . . . . . . . : 10.240.0.4(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.0.0
Lease Obtained. . . . . . . . . . : Thursday, March 15, 2018 10:14:25 AM
Lease Expires . . . . . . . . . . : Sunday, April 21, 2154 4:46:25 PM
Default Gateway . . . . . . . . . : 10.240.0.1
DHCP Server . . . . . . . . . . . : 168.63.129.16
DHCPv6 IAID . . . . . . . . . . . : 218107194
DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-22-3A-AF-3E-00-15-5D-78-11-88
DNS Servers . . . . . . . . . . . : 168.63.129.16
NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter vEthernet (nat):
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
Physical Address. . . . . . . . . : 00-15-5D-01-99-45
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::fc64:6aeb:16a1:8763%8(Preferred)
IPv4 Address. . . . . . . . . . . : 172.26.176.1(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.240.0
Default Gateway . . . . . . . . . :
DHCPv6 IAID . . . . . . . . . . . : 167777629
DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-22-3A-AF-3E-00-15-5D-78-11-88
DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
fec0:0:0:ffff::2%1
fec0:0:0:ffff::3%1
NetBIOS over Tcpip. . . . . . . . : Enabled
After running the above, I created a new microsoft/windowsservercore:1709_KB4088776 pod on the cluster To test the connection i spun up an elastic search pod behind a service "search" exposing port 9200 the service ip : 10.0.96.72 the elastic search pod ip : 10.244.0.11
These are results of my tests
PS C:\> nslookup www.bing.com
DNS request timed out.
timeout was 2 seconds.
Server: UnKnown
Address: 10.0.0.10
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to UnKnown timed-out
PS C:\> Test-NetConnection 10.0.0.10 -Port 53
WARNING: TCP connect to (10.0.0.10 : 53) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut
ComputerName : 10.0.0.10
RemoteAddress : 10.0.0.10
RemotePort : 53
InterfaceAlias : vEthernet (0a966976b45b1a7bca116dd28f1add0adb687f50a4485e25de3460da20f7bc43_l2bridge)
SourceAddress : 10.244.2.211
PingSucceeded : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded : False
this is IP address of a kube DNS pod in the cluster
PS C:\> Test-NetConnection 10.244.0.4 -Port 53
ComputerName : 10.244.0.4
RemoteAddress : 10.244.0.4
RemotePort : 53
InterfaceAlias : vEthernet (0a966976b45b1a7bca116dd28f1add0adb687f50a4485e25de3460da20f7bc43_l2bridge)
SourceAddress : 10.244.2.211
TcpTestSucceeded : True
PS C:\> ipconfig /all
Windows IP Configuration
Host Name . . . . . . . . . . . . : dep-util-5f989584db-g6kzp
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
Ethernet adapter vEthernet (0a966976b45b1a7bca116dd28f1add0adb687f50a4485e25de3460da20f7bc43_l2bridge):
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
Physical Address. . . . . . . . . : 00-15-5D-51-38-1A
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::e0c8:a32e:6c75:16ac%21(Preferred)
IPv4 Address. . . . . . . . . . . : 10.244.2.211(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 10.240.0.1
DNS Servers . . . . . . . . . . . : 10.0.0.10
NetBIOS over Tcpip. . . . . . . . : Disabled
PS C:\> (curl http://search:9200 -usebasicparsing).Content
curl : The remote name could not be resolved: 'search'
At line:1 char:2
+ (curl http://search:9200 -usebasicparsing).Content
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand
PS C:\> (curl http://10.244.0.11:9200 -usebasicparsing).Content
{
"name" : "2okWBvo",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "VoGkVbYgSAG200E7wbwMGA",
"version" : {
"number" : "5.6.5",
"build_hash" : "6a37571",
"build_date" : "2017-12-04T07:50:10.466Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"
},
"tagline" : "You Know, for Search"
}
PS C:\> (curl http://10.0.96.72:9200 -usebasicparsing).Content
curl : Unable to connect to the remote server
At line:1 char:2
+ (curl http://10.0.96.72:9200 -usebasicparsing).Content
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand
It only works in the above instance if i specify the IP address of the Elastic search Pod After this i spun up a pod on the Linux node to test connectivity from there
bash-4.4# wget http://search:9200
Connecting to search:9200 (10.0.96.72:9200)
index.html 100% |********************************************************************************************************************************************************************************************| 327 0:00:00 ETA
bash-4.4# cat index.html
{
"name" : "2okWBvo",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "VoGkVbYgSAG200E7wbwMGA",
"version" : {
"number" : "5.6.5",
"build_hash" : "6a37571",
"build_date" : "2017-12-04T07:50:10.466Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"
},
"tagline" : "You Know, for Search"
}
bash-4.4# ifconfig
eth0 Link encap:Ethernet HWaddr 0A:58:0A:F4:00:0B
inet addr:10.244.0.11 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::60d8:44ff:fe06:fa4c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:277 errors:0 dropped:0 overruns:0 frame:0
TX packets:95 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:20106 (19.6 KiB) TX bytes:9762 (9.5 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:48 errors:0 dropped:0 overruns:0 frame:0
TX packets:48 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2880 (2.8 KiB) TX bytes:2880 (2.8 KiB)
this seems to work perfectly
ACS Version I have installed
PS C:\> acs-engine.exe version
Version: v0.14.0
GitCommit: 6ba4622b
GitTreeState: clean
I am very interested in getting a hybrid cluster up and running. I hope my above logs help in resolving the issue
@4c74356b41 what should the proper permissions for the SP be? In my case it's Owner in the RG where the cluster exists, but I am indeed also seeing those "cannot create LoadBalancer" warnings.
Contributor is fine.
Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10
From: Roy Cornelissen notifications@github.com Sent: Thursday, March 15, 2018 4:27:35 PM To: Azure/acs-engine Cc: 4c74356b41; Mention Subject: Re: [Azure/acs-engine] The cluster-internal DNS server cannot be used from Windows containers (#2027)
@4c74356b41https://github.com/4c74356b41 what should the proper permissions for the SP be? In my case it's Owner in the RG where the cluster exists, but I am indeed also seeing those "cannot create LoadBalancer" warnings.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Azure/acs-engine/issues/2027#issuecomment-373374987, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APccUyyscW3EELuC0q4T1xUS3QeZWknuks5temxHgaJpZM4RZzhA.
@4c74356b41 OK thanks.
@JiangtianLi Unfortunately the fix is not persistent... A pod that was running fine yesterday has now lost DNS again (with no changes, restarts or redeployments) :(
I am having the same issue with DNS. It seems completely random if a deployment will work or not.
/cc @madhanrm
Update for the random DNS issue, we have root caused the issue: The outbound connection would be broken ONLY when a single container POD is created and then deleted on this node (ie no other pod exists on this node). The outbound connection would automatically restore if a new policy is pushed to kubeproxy on this node, which can happen under any one of following conditions • A new service is deployed, which has service vip backed by pod ip • Restart Kubeproxy
However, The issue WILL NOT be seen if there are more than one PODs deployed on this node.
As a mitigation, please deploy an extra POD as DaemonSet on each node. We are working on a fix.
@JiangtianLi After a clean start of the nodes, applying the hns script workaround and deploying a "placeholder" pod as a deamon set, it looks like I indeed have DNS restored on both Windows nodes with each new pod.
Small addition: it seems that this deamonset pod must be the first one deployed on the node, only after that do other pods have external DNS. My dummy pods still don't have DNS, but new ones do. Fingers crossed.
Anyway, thanks for putting so much effort in getting to the bottom of this.
@JiangtianLi EDIT March 19th: Still seeing new pods with no external DNS, sadly... Still not sure what triggers it. My environment is pretty dynamic with pods being started and torn down quite often, but then again... that's what Kubernetes is for :)
Sorry to be another person in this thread with problems! I can use the cluster DNS server through its Pod IP, but not through its service IP.
I have deployed a new cluster this morning using ACS 0.14.1 and K8s 1.9.4
I have tried the reset HNS script above and have spun up multiple Pods as @JiangtianLi mentioned above.
Here are my results from nslookup inside the pods:
PS C:\> nslookup
DNS request timed out.
timeout was 2 seconds.
Default Server: UnKnown
Address: 10.0.0.10
> github.com
Server: UnKnown
Address: 10.0.0.10
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to UnKnown timed-out
> server 10.244.0.10
DNS request timed out.
timeout was 2 seconds.
Default Server: [10.244.0.10]
Address: 10.244.0.10
> github.com
Server: [10.244.0.10]
Address: 10.244.0.10
Non-authoritative answer:
Name: github.com
Addresses: 192.30.253.113
192.30.253.112
I get the same results from my Windows host.
I can reach my DNS service from my Linux master:
> server 10.0.0.10
Default server: 10.0.0.10
Address: 10.0.0.10#53
> google.co.uk
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: google.co.uk
Address: 172.217.17.35
@JiangtianLi DNS is still not working for me as well. Here is how I have setup my cluster:
Result: Internal DNS and external DNS are very random if/when it works.
I run the following script as part of my dockerfile to assess if things are working
Set-DnsClientGlobalSetting -SuffixSearchList @("${env:NAMESPACE}.svc.cluster.local", "svc.cluster.local")
Get-DnsClientGlobalSetting
nslookup www.bing.com
nslookup database
Test-NetConnection 10.0.0.10 -Port 53
ipconfig /all
try { $c = New-Object System.Net.Sockets.TcpClient("database", 1433); $c.close(); write-host 'Connected to database' } catch { write-host 'Failed to connect to database' }
try { $c = New-Object System.Net.Sockets.TcpClient("10.240.0.6", 1433); $c.close(); write-host 'Connected to 10.240.0.6' } catch { write-host 'Failed to connect to 10.240.0.6' }
The output of the script is essentially the same as others have reported. Requests time out or the application or script fails to connect to the database.
Thank you for all your help on this issue.
Can someone from the Microsoft/Azure team please comment on the recommendation of using microsoft/windowsservercore:1709_KB4074588
as the base image for containers? This keeps popping up and my understanding is that the current microsoft/windowsservercore:1709
tag has that KB patch so it shouldn't be necessary to use the KB-specific tag.
:1709 is sort of like latest. Each time a new cumulative update is released, :1709 is tagged to the newest update
1709_kb??????? is a durable tag if you want to manually choose when to update your container
@PatrickLang - yes, understood. My question is whether or not the current 1709
tag has KB4074588 (released February 13, 2018) included. I'm assuming it does since the latest build was 6 days ago and it was built ~1 month after KB4074588 was released. We're struggling with intermittent DNS/service IP issues on containers running the current 1709
tag so I'm trying to make sure our environment is setup properly. Thanks.
You can also see the Windows reversion numbers in docker history
docker history microsoft/windowsservercore:1709
IMAGE CREATED CREATED BY SIZE COMMENT
6e8857bf419a 2 weeks ago Install update 10.0.16299.309 1.59GB
<missing> 5 months ago Apply image 10.0.16299.15 4.62GB
Those are continually increasing, and all updates are cumulative. This page has a decoder ring to go from revision to KB: https://support.microsoft.com/en-us/help/4043454
also - docker inspect windowsservercore:1709
has an OsVersion property
@JiangtianLi I have done all that you suggest. After a redeploy + script + deamonset the pods have internal and external communication. But after a while it suddenly stops and now there is no communication.... If i run the hns script and redeploy my pods. All works fine again...
@JiangtianLi are you aware of this? Will the fix your working on solve this to? Eta?
@qmalexander I've forwarded your issue to the team.
@qmalexander if possible, can you follow:
https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/debug/startpacketcapture.cmd
<Try Test-Connection, until it fails>
https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/debug/stoppacketcapture.cmd
And share c:\server.etl somewhere?
Is there an ETA for a resolution of these DNS issues? It's getting quite painful to do our current project. DNS or no DNS is like a lottery :(
@roycornelissen I moved ahead with my project by continuing to use the workaround for setting the dns addresses in the start up of my windows containers. Version 1.9 fixed the periodic external DNS issues caused by a race condition, but you still have to use the workaround. I just pass in both internal addresses to the pod as arguments. DNS is working consistently on the pods.
$adapter=get-netadapter Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses $dns1,$dns2 Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"
@JiangtianLi When I run the pod it will randomly come up with DNS, no network connectivity at all or no DNS and can still connect to other servers.
I am willing to get a capture for you. Would I run the startpacketcapture.cmd on one of the nodes and start a pod until it has a problem and then send server.etl to you?
Also, we are willing to give you access to our test cluster if that will help. Thanks.
@SteveCurran Thanks for your workaround. It seems to work 90% of the time whereas before it was working more like 10% of the time. Out of curiosity how many pods are you running? We are running around 30.
@zhech2 Yes, we want to capture the trace when the issue happens. The size of trace may be huge if you run for a long time though. Thank you for reporting the issue to help us investigate. I'll also loop back with the team.
@zhech2 I am just running two nodes.
@zhech2 First I would like to know the state of the system. Can you copy the folder https://github.com/Microsoft/SDN/tree/master/Kubernetes/windows/debug and execute "powershell collectlogs.ps1" on one of the problematic host and pass me the link to those files? You can also get a packet capture on that host, by running a simple Test-Connection or nslookup inside the container, when POD is in bad state
I recently rebuilt a cluster and encountered this exact issue of not being able to connect to service IPs (and therefore no DNS) -- my current setup actually does work, although I haven't tested it with more than a few pods.
Here's what I did:
Is this a request for help?: NO
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: canary, GitCommit 8fd4ac4267c29370091d98d80c3046bed517dd8c
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6
What happened:
I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.
Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.
What you expected to happen: Requests to the internal DNS server should not time out.
Steps to reproduce:
Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:
Then run a Windows container. I used the following command:
kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell
Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:
Anything else we need to know: As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.
I observed the behavior independent from the values of the
networkPolicy
(none, azure) andorchestratorRelease
(1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod: