Closed joeyea323 closed 5 years ago
I am seeing this same issue myself. Complete roadblock for me until I recently came across a temporary mitigation of adding this to my dockerfile:
RUN powershell Set-Service dnscache -StartupType disabled RUN powershell Stop-Service dnscache
We do not recommend that you disable DNSCache. There are a few potential things at work here: 1) Windows DNS Client caches both positive (and negative) responses to name resolution requests 2) When container (or service) names are created (and scaled) through Docker, Docker Engine on Windows registers these names with corresponding IP addresses in a DNS Server owned by Docker. Containers therefore first look at the Docker Engine DNS server for name resolution and then look at other DNS servers configured on the container host (inherited in the container). 3) By default, the DNS client will cache name-IP resolutions in the running container. If a particular container instance (part of the set of tasks which make up a service) goes down, the IP address is not immediately invalidated. The mapping is removed within the DNS server (owned by Docker engine) but a client will have to re-query the server to get the updated list of valid mappings. 4) Kubernetes has its own name-IP registration process with Kube-DNS.
We recommend that users set MaxCacheTtl to 0 in the container to inform the DNS client not to cache results (or rather, cache for 0s). We also recommend that users set MaxNegativeCacheTtl to 0 so that any "negative" hits (i.e. DNS resolution requests which returned 0 results) will not be kept in cache therefore instructing the DNS client to always retry resolutions (even if previous resolutions failed).
e.g. in Dockerfile: RUN powershell New-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxCacheTtl -Value 0 -Type DWord
RUN powershell New-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxNegativeCacheTtl -Value 0 -Type DWord
We have received reports that these two settings do not fix all problems. We have several known bugs which we are investigating and who are in various stages of being fixed.
Thanks for the insight @JMesser81. I have tried the registry key changes you mention and they do make things much better, but don't solve the problem. Whereas stopping the dnscache service does completely resolve it. But given the harsh warnings I've heard from almost everyone I've talked to I would never go to production with that change, which is why this one is so frustrating, I have no real solution.
A little more color: I recently deployed an ACS cluster with Server 2016 v1709 Windows nodes hoping it would magically solve the problems given all of the networking changes in it. Alas it made no difference. In fact it felt like a step backwards as neither the dnscache service stop nor the regi keys solve the problem there. I think there may be some permissions issue with the v1709 version as my dockerfile commands don;t seem to actually apply the changes (removing into my container shows the dnscache still running for example).
So I continue to be stuck on this one.
Hmm... could you please try and capture logs/traces for us to analyze what's going on? From an elevated command prompt, please do the following:
• netsh trace start scenario=InternetClient_dbg capture=yes maxSize=1024
•
share the *.cab/etl file which is generated with our team.
Please repro the problem in between the two netsh commands.
@JMesser81 I'm running my app on a nanoserver which does seem to support netsh trace. Do you have an alternate option for nanoserver? Note I also don't have powershell as I'm using v1709 nanoservers.
And one further question, in order to pull files off a k8s container it appears I need a supported version of tar on my local Windows machine. Can you advise me on how I get that working? I am seeing "invalid tar header" messages when I try to copy files locally today.
Can you try with a windows server container? Presumably, if this is a system issue it will repro equally on both nano and windows server core base images. If it does not repro with windows server core, it may be related directly to nano container base image but that will then help narrow our investigation.
Thanks
@JMesser81 hmm, easier said than done. My app is a DotNet Core 2.0 app and there is no prebuilt windowservercore container with DotNet Core 2.x on it. I'm in the process of migrating the code backwards to DotNet 4.7.1 so I can easily run on windowsservercore.
In the meantime though I have attached to a servercore container and find that if I run this command as requested:
netsh trace start scenario=InternetClient_dbg capture=yes maxSize=1024
I get this error: 'InternetClient_dbg' is not a valid scenario. One or more parameters for the command are not correct or missing.
Oddly though in looking at the help for netsh, even the example they give fails similarly. Thoughts?
@JMesser81 I was able to get my app running in windowsservercore as a DotNet 4.7.1 app and repro'd the problem. The nettrace commands you asked me to try didn't work but I was able to run the logman commands below that created an etl file. I was given these commands by another resources. Attahced is the resulting nettrace.etl file. Please let me know if this is helpful.
logman -ets start -mode 2 -max 256 internettrace -p "{4E749B6A-667D-4c72-80EF-373EE3246B08}" 0x7FFFFFFF 5 -o nettrace.etl logman -ets update internettrace -p "{43D1A55C-76D6-4f7e-995C-64C711E5CAFE}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{1A211EE8-52DB-4AF0-BB66-FB8C9F20B0E2}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{B3A7698A-0C45-44DA-B73D-E181C9B5C8E6}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{08F93B14-1608-4a72-9CFA-457EECEDBBA7}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{50b3e73c-9370-461d-bb9f-26f32d68887d}" 0xFFFFFFFFFFFFFFFF 5 logman -ets update internettrace -p "{988ce33b-dde5-44eb-9816-ee156b443ff1}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{41D92334-B49C-4938-85F1-3C22595DB157}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{609151DD-04F5-4DA7-974C-FC6947EAA323}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{1C95126E-7EEA-49A9-A3FE-A378B03DDB4D}" 0xFFFFFFFFFFFFFFFF 5 logman -ets update internettrace -p "{55404E71-4DB9-4DEB-A5F5-8F86E46DDE56}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{0C478C5B-0351-41B1-8C58-4A6737DA32E3}" 0x7FFFFFFF 5
Also having this issue. You'd think the documentation would in clude in big red flashing text "WINDOWS CONTAINERS CAN'T COMMUNICATE OUTSIDE OF THE CLUSTER". Even a brief mention of it in the known limitations would suffice.
+1 I've spent days deploying different clusters, trying different windowsservercore:1709 base images and digging this repo issues. Nothing helped. I'd rather concentrate on something else and wait for the issue to be resolved...
I'm able to curl IPs that exist outside of the cluster, but I cannot curl DNS of servers within the same vnet. I'm going to try with a 1.9.1 cluster.
To circle back around on this I now have a Windows v1803 cluster deployed using ACS-Engine v0.20.9 and am no longer seeing any DNS issues with deployed pods. This issue can be closed.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.
Is this a request for help?: Yes
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: 0.8.0
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kubernetes 1.7.7
What happened: Deployed services are having regular issues resolving external DNS records for certain services they reach out to (one in particular is a Cosmos DB instance). When this occurs, I can access the container, do a DNS cache flush, and then the record will be resolvable.
What you expected to happen: The DNS record should be resolvable at all times.
How to reproduce it (as minimally and precisely as possible): Simply wait for the service to throw an error which typically happens within 30 minutes or less of the last DNS cache flush.
Anything else we need to know: