Don't return cached error if there are cached entry available

TechnitiumSoftware / DnsServer

Technitium DNS Server

https://technitium.com/dns/

GNU General Public License v3.0

4.47k stars 431 forks source link

Don't return cached error if there are cached entry available #965

Closed raphielscape closed 1 month ago

raphielscape commented 4 months ago

Technitium's current caching mechanism is to reduce the frequency of retry attempts by delivering cached errors as responses. However, this approach leads to the serving of cached errors during transient network failures, which make the forwarders unreachable, resulting in the error FailureCache: ServerFailure; NoReachableAuthority being cached if there is a client(s) that querying it more than once at the same time.

This issue persists even when valid cached entries are available, as the server continues to serve the error without attempting to retrieve new entries. This behavior is observed despite the server having a low negative cache TTL configured, which should ideally prevent the cached negative entry from being served during brief network outages. Moreover, the presence of a cached negative result appears to inhibit the use of Serve Stale entries, as the server shows a preference for serving cached negative entries over cached positive ones.

ShreyasZare commented 4 months ago

Thanks for the feedback. Yes, the implementation is to avoid sending frequent requests to the upstream server in case of network issues. The negative cache TTL will cause the clients to be served the negative response and prevent the DNS server from trying to resolve the requests till the negative cache expires. Once the cache is expired, the DNS server will perform the resolution and cache the updated result, overwriting the negative cache. If there is already a "stale" i.e. expired positive answer in cache then it will always be used and it wont get overwritten by negative cache.

The issue you see could be some combination of events causing it. It would be great if you can provide the details that you see in the Cache section on the panel for the specific domain name when the issue occurs. That would help understand the issue based on the cache status. If you can find steps to reproduce this issue then that would too help to understand the issue better.

Also, do you have EDNS Client Subnet option enabled or are you using Advanced Forwarding app?

raphielscape commented 4 months ago

Yes, I do have the EDNS Client Subnet option enabled and also has an override value on it. I also set Serve Stale Max Wait Time to 30ms, and Cache Minimum TTL to 0s, this issue can be reproduced by (un)intentionally making a transient network failure (e.g disconnecting the network or raising the latency artificially using tc network emulator until timeouts happened)

Here's the details in the cache section when it's happened

[
  {
    "name": "whoami.ds.akahelp.net",
    "type": "TXT",
    "ttl": "0 (0 sec)",
    "rData": {
      "dataType": "DnsSpecialCacheRecordData",
      "data": "FailureCache: ServerFailure; NoReachableAuthority: No response from name servers for whoami.ds.akahelp.net. TXT IN"
    },
    "dnssecStatus": "Unknown",
    "eDnsClientSubnet": "XXXX:XXXX:XXXX::",
    "lastUsedOn": "2024-07-10T08:30:26.2770418Z"
  },
  {
    "name": "whoami.ds.akahelp.net",
    "type": "TXT",
    "ttl": "0 (0 sec)",
    "rData": {
      "text": "ipXXX.XXX.XXX.XXX",
      "splitText": true,
      "characterStrings": [
        "ip",
        "XXX.XXX.XXX.XXX"
      ]
    },
    "dnssecStatus": "Disabled",
    "eDnsClientSubnet": "0.0.0.0/0",
    "responseMetadata": {
      "nameServer": "dns.google:853 (8.8.4.4)",
      "protocol": "Tls",
      "datagramSize": "143 bytes",
      "roundTripTime": "74.39 ms"
    },
    "lastUsedOn": "2024-07-10T08:28:27.9442334Z"
  },
  {
    "name": "whoami.ds.akahelp.net",
    "type": "TXT",
    "ttl": "0 (0 sec)",
    "rData": {
      "text": "ecsXXX.XXX.XXX.XXX/24/24",
      "splitText": true,
      "characterStrings": [
        "ecs",
        "XXX.XXX.XXX.XXX/24/24"
      ]
    },
    "dnssecStatus": "Disabled",
    "eDnsClientSubnet": "0.0.0.0/0",
    "responseMetadata": {
      "nameServer": "dns.google:853 (8.8.4.4)",
      "protocol": "Tls",
      "datagramSize": "143 bytes",
      "roundTripTime": "74.39 ms"
    },
    "lastUsedOn": "2024-07-10T08:28:27.9442334Z"
  },
  {
    "name": "whoami.ds.akahelp.net",
    "type": "TXT",
    "ttl": "0 (0 sec)",
    "rData": {
      "text": "nsXXX.XXX.XXX.XXX",
      "splitText": true,
      "characterStrings": [
        "ns",
        "XXX.XXX.XXX.XXX"
      ]
    },
    "dnssecStatus": "Disabled",
    "eDnsClientSubnet": "0.0.0.0/0",
    "responseMetadata": {
      "nameServer": "dns.google:853 (8.8.4.4)",
      "protocol": "Tls",
      "datagramSize": "143 bytes",
      "roundTripTime": "74.39 ms"
    },
    "lastUsedOn": "2024-07-10T08:28:27.9442334Z"
  }
]

dig return:

; <<>> DiG 9.18.24-1-Debian <<>> @localhost txt whoami.ds.akahelp.net
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23753
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 22 (No Reachable Authority): (No response from name servers for whoami.ds.akahelp.net. TXT IN)
; EDE: 13 (Cached Error)
; EDE: 3 (Stale Answer)
;; QUESTION SECTION:
;whoami.ds.akahelp.net.         IN      TXT

;; Query time: 32 msec
;; SERVER: ::1#53(localhost) (UDP)
;; WHEN: Wed Jul 10 08:34:16 UTC 2024
;; MSG SIZE  rcvd: 131

See there is a positive cached return but the client received the EDE 22 instead of using the cached TXT as return

ShreyasZare commented 4 months ago

Thanks for the details. When you have ECS enabled, the client's subnet is used to pick the records from cache. So, even if there is data in cache, its for a subnet that does not match the client.

In this case, the dig request is being sent to ::1 which is ipv6 loobpack address whereas the data in cache is for 0.0.0.0/0 which is ipv4 network. This is the reason its not being used. If you query to 127.0.0.1 ipv4 loopback address then it would return the cached data that you are expecting.

I would also recommend that you reset the cache settings to default values since those are optimal for cache performance. The Cache Failure TTL value is default set to 10 sec and this is the value used in case of failure to resolve. The Cache Negative TTL is used for negative responses, i.e. when there is a response from the upstream like NXDOMAIN or NODATA cases.

raphielscape commented 4 months ago

I will try resetting the cache setting and see how it behaves. However, it may be beneficial to have a knob to disable error caching, particularly when low TTL caching is necessary. There are a lot of DNS services that announce TTLs lower than 10 seconds for failover purposes, even when it may not be strictly necessary, and this is reflected in the amount of Recursive query in the server—Google, for instance, frequently sets a TTL of 5 seconds for YouTube domains DNS responses even though their load balancer generally rotates once every 10-20 seconds, some Akamai responses also sets itself to 0s TTL, which is the reason why I set it to 0s.

ShreyasZare commented 4 months ago

I will try resetting the cache setting and see how it behaves. However, it may be beneficial to have a knob to disable error caching, particularly when low TTL caching is necessary.

The failure caching is actually a beneficial feature and removing it does not have any effect since the underlying issue causing the failure still remains for at least few seconds in most cases. Without this feature, the DNS server would cause lot of server resources to be drained due to frequent recursive resolution attempts. Note that there could be 1000s of concurrent requests making it worse when network drops for few seconds.

The current failure TTL is 10 seconds which you can further reduce if needed. But still 10 sec is a decent value and if your DNS server is running for some time, a lot of data in its cache would be used by the Serve Stale feature. The failure case only arises when there is network issue and no data is available in cache for that specific request. Also, enabling EDNS feature splits the cache to store data as per client subnet which has an additional effect in such cases.

There are a lot of DNS services that announce TTLs lower than 10 seconds for failover purposes, even when it may not be strictly necessary, and this is reflected in the amount of Recursive query in the server—Google, for instance, frequently sets a TTL of 5 seconds for YouTube domains DNS responses even though their load balancer generally rotates once every 10-20 seconds, some Akamai responses also sets itself to 0s TTL, which is the reason why I set it to 0s.

You are confusing record's TTL value with the failure TTL. A record's TTL value even if is very low or even 0, it does not matter if a client reuses it for few seconds. The server on that IP address is not going to stop serving. Its just that those service providers are trying to make sure that the DNS level load balancing works.

raphielscape commented 4 months ago

The failure caching is actually a beneficial feature and removing it does not have any effect since the underlying issue causing the failure still remains for at least few seconds in most cases. Without this feature, the DNS server would cause lot of server resources to be drained due to frequent recursive resolution attempts. Note that there could be 1000s of concurrent requests making it worse when network drops for few seconds.

The current failure TTL is 10 seconds which you can further reduce if needed. But still 10 sec is a decent value and if your DNS server is running for some time, a lot of data in its cache would be used by the Serve Stale feature. The failure case only arises when there is network issue and no data is available in cache for that specific request. Also, enabling EDNS feature splits the cache to store data as per client subnet which has an additional effect in such cases.

So this seems like an unintended behavior then, because the server has been started for some time and there are positive cache already, but the server will return EDE 22

raphielscape commented 4 months ago

Here are the issue reproducible again with the default configuration

Cache:

[
  {
    "name": "cloudflare.com",
    "type": "A",
    "ttl": "6 (6 sec)",
    "rData": {
      "dataType": "DnsSpecialCacheRecordData",
      "data": "FailureCache: ServerFailure; NoReachableAuthority: No response from name servers for cloudflare.com. A IN"
    },
    "dnssecStatus": "Unknown",
    "eDnsClientSubnet": "XXX.XXX.XXX.XXX", <- Overridden eDNS Subnet configured in Server
    "lastUsedOn": "2024-07-10T12:19:23.4306024Z"
  },
  {
    "name": "cloudflare.com",
    "type": "A",
    "ttl": "0 (0 sec)",
    "rData": {
      "ipAddress": "104.16.133.229"
    },
    "dnssecStatus": "Disabled",
    "eDnsClientSubnet": "XXX.XXX.XXX.XXX/24", <- Overridden eDNS Subnet configured in Server
    "responseMetadata": {
      "nameServer": "dns.google:853 (8.8.8.8)",
      "protocol": "Tls",
      "datagramSize": "87 bytes",
      "roundTripTime": "35.63 ms"
    },
    "lastUsedOn": "2024-07-10T12:14:13.5120277Z"
  },
  {
    "name": "cloudflare.com",
    "type": "A",
    "ttl": "0 (0 sec)",
    "rData": {
      "ipAddress": "104.16.132.229"
    },
    "dnssecStatus": "Disabled",
    "eDnsClientSubnet": "XXX.XXX.XXX.XXX/24", <- Overridden eDNS Subnet configured in Server
    "responseMetadata": {
      "nameServer": "dns.google:853 (8.8.8.8)",
      "protocol": "Tls",
      "datagramSize": "87 bytes",
      "roundTripTime": "35.63 ms"
    },
    "lastUsedOn": "2024-07-10T12:14:13.5120277Z"
  }
]

dig result:

; <<>> DiG 9.18.24-1-Debian <<>> @127.0.0.1 cloudflare.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23688
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 22 (No Reachable Authority): (No response from name servers for cloudflare.com. A IN)
; EDE: 13 (Cached Error)
; EDE: 3 (Stale Answer)
;; QUESTION SECTION:
;cloudflare.com.                        IN      A

;; Query time: 30 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Wed Jul 10 12:19:17 UTC 2024
;; MSG SIZE  rcvd: 115

raphielscape commented 4 months ago

The issue is that there is a positive record cached, but if there is a transient network failure that causing the server to fail reaching out to the resolvers, it will return the cached error rather than returning the cached positive record

ShreyasZare commented 4 months ago

Here are the issue reproducible again with the default configuration

I tried to reproduce it but its working here and I get stale answer. Since the cache eDnsClientSubnet is redacted, its not clear how it will interact with queries. If possible, share unredacted data for cache and also screenshots of your EDNS config and Cache config. Send those details to support@technitium.com so that I can try the exact same config and try to reproduce the issue.

raphielscape commented 4 months ago

Here are the issue reproducible again with the default configuration

I tried to reproduce it but its working here and I get stale answer. Since the cache eDnsClientSubnet is redacted, its not clear how it will interact with queries. If possible, share unredacted data for cache and also screenshots of your EDNS config and Cache config. Send those details to support@technitium.com so that I can try the exact same config and try to reproduce the issue.

I have sent the requested information with the reproduction steps on the email

ShreyasZare commented 4 months ago

I have sent the requested information with the reproduction steps on the email

Thanks for the details. I was able to reproduce the issue. The reason for this is that the ECS IPv4 Override option is set to an IP address instead of network address. So, its actually being set as x.x.x.x/32 which is causing the issue with correctly selecting value from the cache. There is also one related issue that was reported over email by one user a couple of weeks ago which is also playing a part in your case.

To fix this issue, you just need to set the ECS IPv4 Override option to x.x.x.x/24 and it will start working as expected.

I am adding a validation code which will fix such cases automatically and prevent this issue from occurring. The other related issue I mentioned is also fixed in development code. Both the fixes will be available with the next update.

ShreyasZare commented 1 month ago

Technitium DNS Server v13 is now available that adds validation for ECS options. Do update and let me know your feedback.