Closed chweidling closed 5 years ago
@jiangtianki thanks. We are not using acs-engine but encountered this bug nonetheless so we appreciate the upstream/public fixes.
I can successfully create a windows container using microsoft/windowsservercore:1709_KB4074588
hm, are there dotnet and aspnet images with that fix?
@4c74356b41 maybe you can try "docker pull" to refresh the base image before building your windows container.
@yuedai wouldnt help unless they updated the images with this fix
@JiangtianLi is this working already? can we recreate the cluster? thanks!
@msorby where have you got that image from? are they being published somewhere, can you provide a link? thanks.
@4c74356b41 Feb Windows update/docker image is already out so it should fix DNS configure issue. I will update here after I confirm in a windows cluster from my side.
@JiangtianLi do you know if\when MS releases the new image for Windows hosts (in Azure)? I've checked today and the latest image for 1709 was in december.
@4c74356b41 using acs-engine 0.13.0, it has the hotfix for the windows host. Then I use this microsoft/windowsservercore:1709_KB4074588 docker image for my core container.
So no need to look for images in Azure, acs-eninge 0.13.0 patches the host.
I also was able to get internal DNS working using acs-engine 0.13.0 and k8s 1.8.4. But I'm not able to get external DNS working -_-
@patrick-motard what DNS server is the container using? what is output of ipconfig /all
inside container?
@JiangtianLi
PS C:\> ipconfig /all
Windows IP Configuration
Host Name . . . . . . . . . . . . : my-app-798c67b4db-gm2tn
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
Ethernet adapter vEthernet (5376f59639304101ffa730ac1d398c1b34f83c602036910eb1257c957800ab24_l2bridge):
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #4
Physical Address. . . . . . . . . : 00-15-5D-06-98-42
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::8510:ea5e:29f6:efe7%25(Preferred)
IPv4 Address. . . . . . . . . . . : 10.244.4.241(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 10.240.0.1
DNS Servers . . . . . . . . . . . : 10.0.0.10
NetBIOS over Tcpip. . . . . . . . : Disabled
I have a windows server within the same subnet with called "my-sql-server". From the windows node i can curl the sql server using curl my-sql-server
and get a 200 response. From inside the container on the node i cannot.
Got it to work today, using acs-engine (commit ba48383a, I build it daily here). What got me really confused is that ping doesn't work from within the container, but Invoke-Webrequest does:
PS C:\> Invoke-webrequest -UseBasicParsing https://google.com
StatusCode : 200
StatusDescription : OK
Notably, I'm pulling microsoft/windowsservercore:1709 image.
Hi,
I've just deployed a cluster with acs-engine 0.13.1 and using microsoft/windowsservercore:1709_KB4074588 as base image for my containers but external dns resolution doesn't work.
IpConfig /all result is the same as @patrick-motard
I've installed all windows updates on win node.
@patrick-motard @cpunella what is output of resolve-dnsname www.bing.com
inside the container?
PS C:\> resolve-dnsname www.bing.com
Name Type TTL Section NameHost
---- ---- --- ------- --------
www.bing.com CNAME 60 Answer www-bing-com.a-0001.a-msedge.net
www-bing-com.a-0001.a-msedge.n CNAME 60 Answer a-0001.dc-msedge.net
et
Name : a-0001.dc-msedge.net
QueryType : A
TTL : 60
Section : Answer
IP4Address : 131.253.33.200
Name : a-0001.dc-msedge.net
QueryType : A
TTL : 60
Section : Answer
IP4Address : 13.107.22.200
PS C:\> Invoke-WebRequest -UseBasicParsing https://google.com
StatusCode : 200
StatusDescription : OK
Content : <!doctype html><html itemscope=""
itemtype="http://schema.org/WebPage" lang="en"><head><meta
content="Search the world's information, including webpages,
images, videos and more. Google has many speci...
RawContent : HTTP/1.1 200 OK
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Cache-Control: private, max-age=0
Content-Type: text/html; charset=UTF-8
Date: Fri, 02 Mar 2018 17:35:53 GMT
Expires: ...
Forms :
Headers : {[X-XSS-Protection, 1; mode=block], [X-Frame-Options, SAMEORIGIN],
[Cache-Control, private, max-age=0], [Content-Type, text/html;
charset=UTF-8]...}
Images : {@{outerHTML=<img alt="Holi 2018" border="0" height="220"
src="/logos/doodles/2018/holi-2018-5209035568578560-l.png"
title="Holi 2018" width="550" id="hplogo"
onload="window.lol&&lol()">; tagName=IMG; alt=Holi 2018; border=0;
height=220;
src=/logos/doodles/2018/holi-2018-5209035568578560-l.png;
title=Holi 2018; width=550; id=hplogo; onload=window.lol&&lol()}}
InputFields : {}
Links : {@{outerHTML=<a onclick=gbar.logger.il(1,{t:1}); class="gbzt gbz0l
gbp1" id=gb_1 href="https://www.google.com/webhp?tab=ww"><span
class=gbtb2></span><span class=gbts>Search</span></a>; tagName=A;
onclick=gbar.logger.il(1,{t:1});; class=gbzt gbz0l gbp1; id=gb_1;
href=https://www.google.com/webhp?tab=ww}, @{outerHTML=<a
onclick=gbar.logger.il(1,{t:2}); class=gbzt id=gb_2
href="https://www.google.com/imghp?hl=en&tab=wi"><span
class=gbtb2></span><span class=gbts>Images</span></a>; tagName=A;
onclick=gbar.logger.il(1,{t:2});; class=gbzt; id=gb_2;
href=https://www.google.com/imghp?hl=en&tab=wi}, @{outerHTML=<a
onclick=gbar.logger.il(1,{t:8}); class=gbzt id=gb_8
href="https://maps.google.com/maps?hl=en&tab=wl"><span
class=gbtb2></span><span class=gbts>Maps</span></a>; tagName=A;
onclick=gbar.logger.il(1,{t:8});; class=gbzt; id=gb_8;
href=https://maps.google.com/maps?hl=en&tab=wl}, @{outerHTML=<a
onclick=gbar.logger.il(1,{t:78}); class=gbzt id=gb_78
href="https://play.google.com/?hl=en&tab=w8"><span
class=gbtb2></span><span class=gbts>Play</span></a>; tagName=A;
onclick=gbar.logger.il(1,{t:78});; class=gbzt; id=gb_78;
href=https://play.google.com/?hl=en&tab=w8}...}
ParsedHtml :
RawContentLength : 47295
@JiangtianLi I can confirm it works for me, but if I dont add start-sleep 5
to my init script sometimes it crashes.
Both google and bing work. Can't hit a server in the same vnet though. I have a server called 'my-server'. I can curl it and get a 200 back from the node itself but not from the container on the node.
PS C:\> curl -UseBasicParsing my-server
curl : The remote name could not be resolved: 'my-server'
At line:1 char:1
+ curl -UseBasicParsing my-server
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:HttpWebReque
st) [Invoke-WebRequest], WebException
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Comman
ds.InvokeWebRequestCommand
Okay.. I'm a little confused. After some testing I am seeing something different than i thought i saw the other day. I cannot curl the server in the vnet using the servers name from linux containers, nor linux nodes, nor windows nodes. I could have sworn i could use both the name and the IP the other day from all of those locations except the windows container.. I'm not going to be able to look at this again until monday. I'll try to reproduce all of this again and give a more detailed explanation then.
@JiangtianLi this is the ouput
PS C:\app> resolve-dnsname www.bing.com resolve-dnsname : www.bing.com : This operation returned because the timeout period expired At line:1 char:1
I'm seeing a really weird behavior, where some of the pods consistently fail to resolve dns, while others work. after updating to latest acs and k8s 1.9.3
upd: its gone on its own after 3 hours. no idea.
I take it back. networking is extremely unreliable at startup. its just unreliable. no conditions.
Ok, more findings, acs 0.13.1 doesnt install 2018-02 Cumulative Update for Windows 10 Version 1709 for x64-based Systems (KB4074588) to the windows nodes. is this expected? after installing that update and rebooting internet is gone :)
Is there a way to get the internal traffic to work? I run: acs-engine v0.13.1 and Kubernetes 1.9.1. The external traffic works.
Right, is I thought that I had had it working. acs-engine 0.13.0 and K8i 1.9.3, using this image as basis for my container microsoft/windowsservercore:1709_KB4074588. But I'm experiencing the same as @4c74356b41, it's just not reliable. It was working for a bit, but then it stoped and after that it's a no go. This is for external resources.
@msorby my containers lose internet after node reboot :) tested on 3 clusters built from scratch ;)
@JiangtianLi any ideas? :blush:
@4c74356b41 Regarding issue with reboot, there is a PR to fix it: https://github.com/Azure/acs-engine/pull/2378. acs-engine doesn't choose windows version, it always uses the latest from Azure.
@qmalexander For internal traffic, does kube-dns and nslookup kubernetes
on your end? Does internal traffic on linux node?
@madhanrm In case those issues are known from Windows team.
@JiangtianLi so should I install that kb on agent nodes or not? I thought you said the feb update is required for cluster endpoints to work?
@4c74356b41 You should not need to manually install anything. acs-engine already patch package in case Feb update is not out. With Feb update, there is no action for you too. What is version on your windows node? What is output of the following:
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion" /v BuildLab
Get-HotFix
@JiangtianLi
16299.rs3_release.170928-1534
KB123456
KB4087256
KB999999
KB4056892
the 4056892 is listed as installed by me. 4087256 as system, but 3 days after cluster provision
do you know if it is possible to just delete windows nodes only and redeploy same definition into the same resource group (but built from the pr you mentioned)? or will it crap out? Also, should I use cni or not? I'm asking in terms of stability only. which is more stable at the moment? because for me both do not really work (without that PR at least) :(
@4c74356b41 I lost external dns resolution without node rebooting. But once it's gone it's gone for all pods created.
@4c74356b41 Azure CNI is not default in networkpolicy and it is also beta stage. If you have any issue with Azure CNI, please report with any detailed repro steps and I will loop in Azure Networking folks.
@msorby What is ipconfig /all
in your container? Can you reach kube-dns from container? Is it external name (www.bing.com) doesn't work or internal name (kubernetes or other k8s service) too?
Ethernet adapter vEthernet (be1b1dcbfdb5d5238c2680576bdd9d30864cb7e20f639310695879f2b4138d51_l2bridge):
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
Physical Address. . . . . . . . . : 00-15-5D-D4-9E-2D
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::462:a0ff:77d2:68a0%21(Preferred)
IPv4 Address. . . . . . . . . . . : 10.244.5.168(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 10.240.0.1
DNS Servers . . . . . . . . . . . : 10.0.0.10
NetBIOS over Tcpip. . . . . . . . : Disabled
I cant resolve anything.
PS C:\> telnet 10.0.0.10 53
Connecting To 10.0.0.10... Could not open connection to the host, on port 53: Connect failed
Get-NSHNetwork | ? name -eq 'l2bridge' | Remove-HNSNetwork
do not help.
regarding my previous post. does that look ok? or is something missing from the node?
@4c74356b41 It appears your kube-dns is not reachable. Can you get kube-dns's status and logs? Does DNS query work from linux node?
@JiangtianLi yes, resolution works from linux node\containers. this is from linux container:
sh-4.2# getent hosts ya.ru
2a02:6b8::2:242 ya.ru
204.79.197.200 bing.com
13.107.21.200 bing.com
sh-4.2# getent hosts google.tt
2a00:1450:4009:80b::2003 google.tt
how to get kube-dns status? here's the logs from kubedns:
I0302 07:54:36.261100 1 dns.go:173] Waiting for services and endpoints to be initialized from apiserver...
I0302 07:54:36.736185 1 dns.go:170] Initialized services and endpoints from apiserver
I0302 07:54:36.736197 1 server.go:135] Setting up Healthz Handler (/readiness)
I0302 07:54:36.736203 1 server.go:140] Setting up cache handler (/cache)
I0302 07:54:36.736210 1 server.go:126] Status HTTP port 8081
I0305 09:07:05.154478 1 logs.go:41] skydns: failure to forward request "read udp 10.244.0.5:54747->168.63.129.16:53: i/o timeout"
I0305 09:07:05.155134 1 logs.go:41] skydns: failure to forward request "read udp 10.244.0.5:54747->168.63.129.16:53: i/o timeout"
nothing valuable before that. other one:
I0302 07:54:26.301287 1 dns.go:146] Starting endpointsController
I0302 07:54:26.301291 1 dns.go:149] Starting serviceController
I0302 07:54:26.301407 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0302 07:54:26.301415 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0302 07:54:26.804931 1 dns.go:173] Waiting for services and endpoints to be initialized from apiserver...
XXXX redacted XXXX
I0302 07:54:34.804100 1 dns.go:170] Initialized services and endpoints from apiserver
I0302 07:54:34.804116 1 server.go:135] Setting up Healthz Handler (/readiness)
I0302 07:54:34.804123 1 server.go:140] Setting up cache handler (/cache)
I0302 07:54:34.804132 1 server.go:126] Status HTTP port 8081
@4c74356b41 kubectl get po -n kube-system -o wide
@JiangtianLi ah, that status. all running.
kube-dns-v20-597689868c-ftpcv 3/3 Running 0 4d 10.244.0.6 k8s-linpul-39524942-0
kube-dns-v20-597689868c-m7vps 3/3 Running 0 4d 10.244.0.5 k8s-linpul-39524942-0
@4c74356b41 Also what is kubectl get no -o wide
output?
why is kube dns not on master nodes? i would assume it belongs there. but i dont really know k8s all that good :(
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
39524k8s9000 Ready <none> 4d v1.9.3 51.141.90.143 Windows Server Datacenter 10.0.16299.192 docker://17.6.2
39524k8s9001 NotReady <none> 4d v1.9.3 <none> Windows Server Datacenter 10.0.16299.192 docker://17.6.2
k8s-linpul-39524942-0 Ready agent 4d v1.9.3 <none> Debian GNU/Linux 9 (stretch) 4.13.0-1007-azure docker://1.13.1
k8s-master-39524942-0 Ready master 4d v1.9.3 <none> Debian GNU/Linux 9 (stretch) 4.13.0-1007-azure docker://1.13.1
k8s-master-39524942-1 Ready master 4d v1.9.3 <none> Debian GNU/Linux 9 (stretch) 4.13.0-1007-azure docker://1.13.1
k8s-master-39524942-2 Ready master 4d v1.9.3 <none> Debian GNU/Linux 9 (stretch) 4.13.0-1007-azure docker://1.13.1
one windows node is shutdown by me to save costs (since its not working anyway).
@4c74356b41 kube-dns is add-on pod and can be scheduled on agent node. Can you share the output of the following in your container?
Test-NetConnection 10.0.0.10 -port 53
Test-NetConnection 10.244.0.5 -port 53
Resolve-DnsName www.bing.com
Also on windows node:
Test-NetConnection 10.244.0.5 -port 53
agent node:
PS C:\> Test-NetConnection 10.244.0.5 -port 53
ComputerName : 10.244.0.5
RemoteAddress : 10.244.0.5
RemotePort : 53
InterfaceAlias : vEthernet (Ethernet 2)
SourceAddress : 10.240.0.4
TcpTestSucceeded : True
container:
PS C:\> Test-NetConnection 10.0.0.10 -port 53
WARNING: TCP connect to (10.0.0.10 : 53) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut
ComputerName : 10.0.0.10
RemoteAddress : 10.0.0.10
RemotePort : 53
InterfaceAlias : vEthernet (be1b1dcbfdb5d5238c2680576bdd9d30864cb7e20f639310695879f2b4138d51_l2bridge)
SourceAddress : 10.244.5.168
PingSucceeded : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded : False
PS C:\> resolve-dnsname bing.com
resolve-dnsname : bing.com : This operation returned because the timeout period expired
At line:1 char:1
+ resolve-dnsname bing.com
+ ~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationTimeout: (bing.com:String) [Resolve-DnsName], Win32Exception
+ FullyQualifiedErrorId : ERROR_TIMEOUT,Microsoft.DnsClient.Commands.ResolveDnsName
PS C:\> Test-NetConnection 10.244.0.5 -port 53
WARNING: TCP connect to (10.244.0.5 : 53) failed
WARNING: Ping to 10.244.0.5 failed with status: TimedOut
ComputerName : 10.244.0.5
RemoteAddress : 10.244.0.5
RemotePort : 53
InterfaceAlias : vEthernet (be1b1dcbfdb5d5238c2680576bdd9d30864cb7e20f639310695879f2b4138d51_l2bridge)
SourceAddress : 10.244.5.168
PingSucceeded : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded : False
@4c74356b41 Sorry, updated the commands. Can you retry?
@4c74356b41 Can you use https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/hns.psm1 to run
Get-HnsEndpoints | ConvertTo-Json -depth 10
Get-HnsPolicyLists | ConvertTo-Json -depth 10
on windows node? Is kube-proxy running on windows node?
sc query kubeproxy
endpoints: https://paste.ee/p/gqMaE policy lists: https://paste.ee/p/rUJFn
PS C:\> get-service kube*
Status Name DisplayName
------ ---- -----------
Running Kubelet Kubelet
Running Kubeproxy Kubeproxy
but sc query kubeproxy
returns nothing
I'm also seeing the same issue as @4c74356b41 on my newly provisioned hybrid cluster with windows containers only. No internal or external dns resolution. I have similar outputs from @JiangtianLi 's command as @4c74356b41.
Is this a request for help?: NO
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: canary, GitCommit 8fd4ac4267c29370091d98d80c3046bed517dd8c
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6
What happened:
I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.
Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.
What you expected to happen: Requests to the internal DNS server should not time out.
Steps to reproduce:
Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:
Then run a Windows container. I used the following command:
kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell
Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:
Anything else we need to know: As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.
I observed the behavior independent from the values of the
networkPolicy
(none, azure) andorchestratorRelease
(1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod: