TechnitiumSoftware / DnsServer

Technitium DNS Server
https://technitium.com/dns/
GNU General Public License v3.0
4.25k stars 418 forks source link

Problem accessing some sites (resolving issue) #199

Closed EHRETic closed 3 years ago

EHRETic commented 3 years ago

Hi there,

I'm a little embarassed because I couldn't really figure out what is exactly the issue, except that, when I empty the cache, it works instantly again. This morning, I had problem to access calendar.google.com (again - not the first time). Once I flushed the cache, it worked straight away. It is a bit annoying because especially, is using it a lot for her business.

Here are the logs from today, you can consider the 2 tries @ 9:09 are mine and didn't work, even if it seems so:

[2020-11-27 07:12:55 UTC] [XXX.XXX.XXX.XXX:55635] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.214.78]
[2020-11-27 07:20:02 UTC] [XXX.XXX.XXX.XXX:53307] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.206.238]
[2020-11-27 07:32:21 UTC] [XXX.XXX.XXX.XXX:50103] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.204.142]
[2020-11-27 07:42:21 UTC] [XXX.XXX.XXX.XXX:65037] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [172.217.18.206]
[2020-11-27 07:43:16 UTC] [XXX.XXX.XXX.XXX:51977] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.204.142]
[2020-11-27 07:52:21 UTC] [XXX.XXX.XXX.XXX:56738] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.213.142]
[2020-11-27 08:02:21 UTC] [XXX.XXX.XXX.XXX:57354] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [142.250.74.238]
[2020-11-27 08:03:27 UTC] [XXX.XXX.XXX.XXX:49897] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.201.238]
[2020-11-27 08:12:21 UTC] [XXX.XXX.XXX.XXX:57396] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.213.174]
[2020-11-27 08:22:21 UTC] [XXX.XXX.XXX.XXX:64699] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.206.238]
[2020-11-27 08:43:31 UTC] [XXX.XXX.XXX.XXX:65438] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.204.142]
[2020-11-27 08:52:21 UTC] [XXX.XXX.XXX.XXX:51882] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.213.142]
[2020-11-27 09:02:21 UTC] [XXX.XXX.XXX.XXX:50030] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [172.217.18.206]
[2020-11-27 09:09:28 UTC] [XXX.XXX.XXX.XXX:49604] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.215.46]
[2020-11-27 09:09:44 UTC] [XXX.XXX.XXX.XXX:55172] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.215.46]

[2020-11-27 09:10:17 UTC] [XXX.XXX.XXX.XXX:0] [admin] Cache was flushed.

[2020-11-27 09:10:41 UTC] [XXX.XXX.XXX.XXX:62843] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.213.78]
[2020-11-27 09:10:44 UTC] [XXX.XXX.XXX.XXX:49958] [UDP] QNAME: calendar-pa.clients6.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.204.138]
[2020-11-27 09:10:45 UTC] [XXX.XXX.XXX.XXX:53565] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.213.78]
[2020-11-27 09:10:48 UTC] [XXX.XXX.XXX.XXX:58610] [UDP] QNAME: calendar-pa.clients6.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.204.138]
[2020-11-27 09:12:21 UTC] [XXX.XXX.XXX.XXX:49641] [UDP] QNAME: calendar.google.com; QTYPE: A; QCLASS: IN; RCODE: NoError; ANSWER: [216.58.213.78]

I put the log entry when I flushed the cache and right after the entries that did work.

As I had this issue before and because I thought the problem was Cloudfare DNS or Quad 9 one (or at least I though this was the issue at that time), I put a forward zone for google.com to their DNS servers (so there is no "excuse" for not solving their own services)

image

But as I had the problem with this "fix", I'm now a bit out of ideas... Can you help please? πŸ˜‰

PS: last week, I had also trouble resolving one of my domain controllers IPs from a client computer (so local zone that I forward straight to my DCs DNS). Once cache flushed, it worked again instantly. It is like the cache doesn't work or doesn't "answer" to clients after a while for certain entries, but this is just a feeling.

ShreyasZare commented 3 years ago

Thanks for the feedback. Can you try to figure out a way to reproduce this issue? Since, the logs you provided does show that the domain was resolved.

Even in case of your DC setup, try to see what the cache has before resolving the domain when you start the day so that you will see any expired records that exists and then use the DNS Client tool to query the domain and see if you are getting the domain resolved and if the cache also reflects the latest data.

Without a method to reproduce this issue it would be quite difficult to figure out what is going wrong.

malix0 commented 3 years ago

I dont't know if my problem can be related to this one, but today I upgraded to latest Technitium version 5.5 and also get November 19, 2020-KB4586878 Cumulative Update Preview for .NET Framework 3.5 and 4.8 for Windows 10. Now I have problem resolving names outside my zones. I get this error [2020-11-29 18:18:31 UTC] DNS Server recursive resolution failed for QNAME: answers.microsoft.com; QTYPE: A; QCLASS: IN; Name Servers: 1.1.1.1:53, 1.0.0.1:53; TechnitiumLibrary.Net.Dns.DnsClientResponseValidationException: Invalid response was received: QNAME mismatch. in TechnitiumLibrary.Net.Dns.DnsClient.<>c__DisplayClass36_0.<<InternalResolveAsync>g__DoResolveAsync|1>d.MoveNext()

ShreyasZare commented 3 years ago

@malix0 thanks for the post. If you had v5.3 or older installed and upgraded to v5.5 then there is added security checks called QNAME randomization (you can disable it in settings but its recommended that it be enabled).

The error you are seeing is telling that the domain that was sent to cloudflare DNS does not match with the response received. It could be that you are not really querying cloudflare DNS instead, your ISP is hijacking DNS requests and answering them instead. Try using DNS-over-HTTPS or DNS-over-TLS and see if the error stops coming in logs.

EHRETic commented 3 years ago

Even in case of your DC setup, try to see what the cache has before resolving the domain when you start the day so that you will see any expired records that exists and then use the DNS Client tool to query the domain and see if you are getting the domain resolved and if the cache also reflects the latest data.

Without a method to reproduce this issue it would be quite difficult to figure out what is going wrong.

The issue (at least with Google Calendar) came back. Always solved by emptying the cache.

But what I've noticed, the IP that is solved after the reset changes... which makes me wonder what kind of failover/load balancing system they are using. How is reacting the cache if the IP changes ?

ShreyasZare commented 3 years ago

Thanks for the details. When the issue comes back again, check the Cache Zone to see what IP it has and then use the DNS Client tool to query calendar.google.com and see if the new IP is getting updated in the cache. Check what IP address you see in the web browser with developer tools > network tab. If you open cmd and ping calendar.google.com then you will see what IP the OS has cached.

It could be that the DNS server when gets the query for calendar.google.com, the cached records are expired so it tries to get the latest data. But querying for latest data is taking time so the serve stale feature is triggered and the expired record in cache is used to return to the original query and the expired record's TTL is updated to 30 sec. Now, the user has old IP address from the stale cache and the web browser / OS caches it further for a couple of minutes meanwhile the DNS server get reply from the upstream server and updates cache with latest data overwriting the stale data.

In such case, the DNS server will have returned stale IP address and in a while when it gets response from upstream, it will update the cache too. But now user's web browser and OS is stuck with old IP address for a couple of minutes and the old IP address might be down for maintenance by Google.

EHRETic commented 3 years ago

Thanks for the details. When the issue comes back again, check the Cache Zone to see what IP it has and then use the DNS Client tool to query calendar.google.com and see if the new IP is getting updated in the cache. Check what IP address you see in the web browser with developer tools > network tab. If you open cmd and ping calendar.google.com then you will see what IP the OS has cached.

It could be that the DNS server when gets the query for calendar.google.com, the cached records are expired so it tries to get the latest data. But querying for latest data is taking time so the serve stale feature is triggered and the expired record in cache is used to return to the original query and the expired record's TTL is updated to 30 sec. Now, the user has old IP address from the stale cache and the web browser / OS caches it further for a couple of minutes meanwhile the DNS server get reply from the upstream server and updates cache with latest data overwriting the stale data.

In such case, the DNS server will have returned stale IP address and in a while when it gets response from upstream, it will update the cache too. But now user's web browser and OS is stuck with old IP address for a couple of minutes and the old IP address might be down for maintenance by Google.

Hi,

I didn't try but the problem is hapening quite often. I have the feeling Google is changing/rotating their servers all the time.

Is there any way NOT to cache a zone? If not, that might be an option. I don't know how Windows DNS is handling such case, but as far as I can remember, I never had problems before

This makes me sad, I really like the benefits of Technitium but if my wife has too much issues, and I can't blame her! I'll have to rollback... πŸ€”

ShreyasZare commented 3 years ago

Thanks for the feedback. I think the fix for this issue will be to make the cache serve-stale feature configurable so that it can be disabled to avoid the issue. Serve-stale keeps expired records in cache up to 7 days so this option too can be made configurable to set to lower value so that expired records are discarded early.

Will try to get both these option in next release.

EHRETic commented 3 years ago

Thanks for the feedback. I think the fix for this issue will be to make the cache serve-stale feature configurable so that it can be disabled to avoid the issue. Serve-stale keeps expired records in cache up to 7 days so this option too can be made configurable to set to lower value so that expired records are discarded early.

Will try to get both these option in next release.

That's wonderful ! πŸ˜‰

Quax1507 commented 3 years ago

Same problem here. Can I download compiled version 5.3 for Windows Setup somwhere?

ShreyasZare commented 3 years ago

Same problem here. Can I download compiled version 5.3 for Windows Setup somwhere?

Thanks for the feedback. Its not recommended to downgrade version since the older version may not be able to read the config file changes that the new version has written. So, in such a case you will need to delete old config and let the old version create default config files which will cause loss of settings or items like zone files or dhcp scopes.

The next version should be available in Jan so you will get this issue fixed automatically with an upgrade if you are willing to wait for a while.

Quax1507 commented 3 years ago

Sorry, but I can't wait until January :-( The config is not that much - I can do that within minutes... Could You give me a download link, please?

ShreyasZare commented 3 years ago

Sorry, but I can't wait until January :-( The config is not that much - I can do that within minutes... Could You give me a download link, please?

Sure, you need windows or linux version?

Quax1507 commented 3 years ago

I need the Windows (Setup) version

ShreyasZare commented 3 years ago

Download from here: DnsServerSetupv5.3.zip

Quax1507 commented 3 years ago

Thank You very very much!

EHRETic commented 3 years ago

The next version should be available in Jan so you will get this issue fixed automatically with an upgrade if you are willing to wait for a while.

Lovelly, I'll do the update with my distro migration, thanks a lot! πŸ˜„

I just learned that CentOS 8 will be over next year... I'll switch to Ubuntu Server and cross fingers they will not be bought by IBM too!!! Everybody has to move to CentOS Stream, which is, from my understanding the future beta RHEL... πŸ˜‘

https://blog.centos.org/2020/12/future-is-centos-stream/

EHRETic commented 3 years ago

Hi there,

Don't know if Google fixed their "never-ever moving" servers, but it seems that since I've disabled QNAME randomization, I don't have any issue anymore. So I don't know which one for now. I might try to reactivate QNAME to see and let you know. πŸ˜‰

ShreyasZare commented 3 years ago

Hi there,

Don't know if Google fixed their "never-ever moving" servers, but it seems that since I've disabled QNAME randomization, I don't have any issue anymore. So I don't which one for now. I might try to reactivate QNAME to see and let you know. πŸ˜‰

Thanks for the update. The QNAME randomization feature works only with UDP transport and since you have DoH configured, it wont have any effect.

EHRETic commented 3 years ago

Hi again,

Some updates: so with google servers, I don't have any issue anymore, so it seems it was related to them combined with some internal problems because what still remains is my local domain issues. From time to time, I can't access a local server anymore because the record is not valide anymore, despite the server being always online. When this server is the Domain Controller... big trouble! 😁

QNAME randomization is deactivated in the settings (despite the fact it should not be used anyway, I let it off)

Let's wait the next version to test it further! πŸ˜‰

ShreyasZare commented 3 years ago

Hi again,

Some updates: so with google servers, I don't have any issue anymore, so it seems it was related to them combined with some internal problems because what still remains is my local domain issues. From time to time, I can't access a local server anymore because the record is not valide anymore, despite the server being always online. When this server is the Domain Controller... big trouble! 😁

QNAME randomization is deactivated in the settings (despite the fact it should not be used anyway, I let it off)

Let's wait the next version to test it further! πŸ˜‰

Are you using conditional forwarder for the local domain? This is since there was one issue that was fixed in cache prefetching which would overwrite cache in certain conditions.

Anyways the next release will be available by this weekend so lets see if that fixes the issue.

ShreyasZare commented 3 years ago

New update is available which should fix this issue. Do reopen this issue if it occurs again.

EHRETic commented 3 years ago

Hi again, Some updates: so with google servers, I don't have any issue anymore, so it seems it was related to them combined with some internal problems because what still remains is my local domain issues. From time to time, I can't access a local server anymore because the record is not valide anymore, despite the server being always online. When this server is the Domain Controller... big trouble! 😁 QNAME randomization is deactivated in the settings (despite the fact it should not be used anyway, I let it off) Let's wait the next version to test it further! πŸ˜‰

Are you using conditional forwarder for the local domain? This is since there was one issue that was fixed in cache prefetching which would overwrite cache in certain conditions.

Anyways the next release will be available by this weekend so lets see if that fixes the issue.

Yes I'am using conditional forwarder for my local domain. I've updated all my servers now to 5.6, let's see if this comes again. Thanks a lot for your support!

PS: I can feel the performance increase, even on my small environment... impressive & great work! πŸ˜‰