Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 560 forks source link

External DNS resolution issues #1680

Closed joeyea323 closed 5 years ago

joeyea323 commented 7 years ago

Is this a request for help?: Yes

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: 0.8.0

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kubernetes 1.7.7

What happened: Deployed services are having regular issues resolving external DNS records for certain services they reach out to (one in particular is a Cosmos DB instance). When this occurs, I can access the container, do a DNS cache flush, and then the record will be resolvable.

What you expected to happen: The DNS record should be resolvable at all times.

How to reproduce it (as minimally and precisely as possible): Simply wait for the service to throw an error which typically happens within 30 minutes or less of the last DNS cache flush.

Anything else we need to know:

brobichaud commented 7 years ago

I am seeing this same issue myself. Complete roadblock for me until I recently came across a temporary mitigation of adding this to my dockerfile:

RUN powershell Set-Service dnscache -StartupType disabled RUN powershell Stop-Service dnscache

JMesser81 commented 6 years ago

We do not recommend that you disable DNSCache. There are a few potential things at work here: 1) Windows DNS Client caches both positive (and negative) responses to name resolution requests 2) When container (or service) names are created (and scaled) through Docker, Docker Engine on Windows registers these names with corresponding IP addresses in a DNS Server owned by Docker. Containers therefore first look at the Docker Engine DNS server for name resolution and then look at other DNS servers configured on the container host (inherited in the container). 3) By default, the DNS client will cache name-IP resolutions in the running container. If a particular container instance (part of the set of tasks which make up a service) goes down, the IP address is not immediately invalidated. The mapping is removed within the DNS server (owned by Docker engine) but a client will have to re-query the server to get the updated list of valid mappings. 4) Kubernetes has its own name-IP registration process with Kube-DNS.

We recommend that users set MaxCacheTtl to 0 in the container to inform the DNS client not to cache results (or rather, cache for 0s). We also recommend that users set MaxNegativeCacheTtl to 0 so that any "negative" hits (i.e. DNS resolution requests which returned 0 results) will not be kept in cache therefore instructing the DNS client to always retry resolutions (even if previous resolutions failed).

e.g. in Dockerfile: RUN powershell New-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxCacheTtl -Value 0 -Type DWord

RUN powershell New-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxNegativeCacheTtl -Value 0 -Type DWord

We have received reports that these two settings do not fix all problems. We have several known bugs which we are investigating and who are in various stages of being fixed.

brobichaud commented 6 years ago

Thanks for the insight @JMesser81. I have tried the registry key changes you mention and they do make things much better, but don't solve the problem. Whereas stopping the dnscache service does completely resolve it. But given the harsh warnings I've heard from almost everyone I've talked to I would never go to production with that change, which is why this one is so frustrating, I have no real solution.

A little more color: I recently deployed an ACS cluster with Server 2016 v1709 Windows nodes hoping it would magically solve the problems given all of the networking changes in it. Alas it made no difference. In fact it felt like a step backwards as neither the dnscache service stop nor the regi keys solve the problem there. I think there may be some permissions issue with the v1709 version as my dockerfile commands don;t seem to actually apply the changes (removing into my container shows the dnscache still running for example).

So I continue to be stuck on this one.

JMesser81 commented 6 years ago

Hmm... could you please try and capture logs/traces for us to analyze what's going on? From an elevated command prompt, please do the following: • netsh trace start scenario=InternetClient_dbg capture=yes maxSize=1024 • • netsh trace stop

share the *.cab/etl file which is generated with our team.

JMesser81 commented 6 years ago

Please repro the problem in between the two netsh commands.

brobichaud commented 6 years ago

@JMesser81 I'm running my app on a nanoserver which does seem to support netsh trace. Do you have an alternate option for nanoserver? Note I also don't have powershell as I'm using v1709 nanoservers.

And one further question, in order to pull files off a k8s container it appears I need a supported version of tar on my local Windows machine. Can you advise me on how I get that working? I am seeing "invalid tar header" messages when I try to copy files locally today.

JMesser81 commented 6 years ago

Can you try with a windows server container? Presumably, if this is a system issue it will repro equally on both nano and windows server core base images. If it does not repro with windows server core, it may be related directly to nano container base image but that will then help narrow our investigation.

Thanks

brobichaud commented 6 years ago

@JMesser81 hmm, easier said than done. My app is a DotNet Core 2.0 app and there is no prebuilt windowservercore container with DotNet Core 2.x on it. I'm in the process of migrating the code backwards to DotNet 4.7.1 so I can easily run on windowsservercore.

In the meantime though I have attached to a servercore container and find that if I run this command as requested:

netsh trace start scenario=InternetClient_dbg capture=yes maxSize=1024

I get this error: 'InternetClient_dbg' is not a valid scenario. One or more parameters for the command are not correct or missing.

Oddly though in looking at the help for netsh, even the example they give fails similarly. Thoughts?

brobichaud commented 6 years ago

@JMesser81 I was able to get my app running in windowsservercore as a DotNet 4.7.1 app and repro'd the problem. The nettrace commands you asked me to try didn't work but I was able to run the logman commands below that created an etl file. I was given these commands by another resources. Attahced is the resulting nettrace.etl file. Please let me know if this is helpful.

logman -ets start -mode 2 -max 256 internettrace -p "{4E749B6A-667D-4c72-80EF-373EE3246B08}" 0x7FFFFFFF 5 -o nettrace.etl logman -ets update internettrace -p "{43D1A55C-76D6-4f7e-995C-64C711E5CAFE}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{1A211EE8-52DB-4AF0-BB66-FB8C9F20B0E2}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{B3A7698A-0C45-44DA-B73D-E181C9B5C8E6}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{08F93B14-1608-4a72-9CFA-457EECEDBBA7}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{50b3e73c-9370-461d-bb9f-26f32d68887d}" 0xFFFFFFFFFFFFFFFF 5 logman -ets update internettrace -p "{988ce33b-dde5-44eb-9816-ee156b443ff1}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{41D92334-B49C-4938-85F1-3C22595DB157}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{609151DD-04F5-4DA7-974C-FC6947EAA323}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{1C95126E-7EEA-49A9-A3FE-A378B03DDB4D}" 0xFFFFFFFFFFFFFFFF 5 logman -ets update internettrace -p "{55404E71-4DB9-4DEB-A5F5-8F86E46DDE56}" 0x7FFFFFFF 5 logman -ets update internettrace -p "{0C478C5B-0351-41B1-8C58-4A6737DA32E3}" 0x7FFFFFFF 5

nettrace.zip

patrick-motard commented 6 years ago

Also having this issue. You'd think the documentation would in clude in big red flashing text "WINDOWS CONTAINERS CAN'T COMMUNICATE OUTSIDE OF THE CLUSTER". Even a brief mention of it in the known limitations would suffice.

4ux-nbIx commented 6 years ago

+1 I've spent days deploying different clusters, trying different windowsservercore:1709 base images and digging this repo issues. Nothing helped. I'd rather concentrate on something else and wait for the issue to be resolved...

patrick-motard commented 6 years ago

I'm able to curl IPs that exist outside of the cluster, but I cannot curl DNS of servers within the same vnet. I'm going to try with a 1.9.1 cluster.

brobichaud commented 6 years ago

To circle back around on this I now have a Windows v1803 cluster deployed using ACS-Engine v0.20.9 and am no longer seeing any DNS issues with deployed pods. This issue can be closed.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.