NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
2.93k stars 341 forks source link

[FR] Managing Cache Deletion and Fallback to Forwarding During Unbound Recursive DNS Failures #1061

Open kkkgo opened 2 months ago

kkkgo commented 2 months ago

I have been using Unbound for about four to five years, greatly benefiting from its robust caching capabilities, which generally result in very fast DNS responses. I employ the following primary configuration to enable Unbound's optimistic caching:

    serve-expired: yes
    serve-expired-ttl: 0
    serve-expired-reply-ttl: 0
    prefetch: yes

My expected operational outcomes are:

This configuration works well under most circumstances, and I have also shared it as a Docker image with others. However, due to the unstable network quality of some ISPs, particularly for authoritative DNS servers located abroad, there are consistent connectivity issues, leading to:

  1. Some domains consistently fail recursive queries, yielding no results.
  2. Some domains occasionally succeed in recursive queries but fail most of the time, leading to stale cache data.

For the first issue, I have implemented a simple plugin that uses a third-party DNS as a downstream fallback for Unbound. The server first attempts to resolve via Unbound, and if it fails to get a response within a specific time frame (e.g., 200 ms), it forwards the request to a public DNS. This is also applicable for fault tolerance when the Unbound service is interrupted.

For the second issue, I found it challenging to make decisions downstream because once a domain's recursive query succeeds, its result is continuously cached, even if set with a long serve-expired-ttl. This means the cache keeps serving the data with a TTL of 0. Hence, if I use TTL=0 as a criterion for fallback DNS, it negates the benefit of serve-expired: yes. The crux of the problem is that I cannot use downstream DNS to determine if Unbound has successfully completed a recursive query.

To address this, I propose two possible solutions:

  1. Introduce a threshold for deleting cache entries after consecutive DNS refresh failures, such as serve-expired-fetch-fail: 5. If a DNS refresh query fails more than five times, the expired result should be removed from the cache. This would allow downstream DNS to recognize Unbound's unavailability and switch to a public DNS result, a process feasible for most DNS servers with parallel query capabilities.
  2. On a recursive query failure, attempt to fall back to a public DNS. This could be facilitated by adding an option like recursive-first: yes:
server:
    forward-zone:
        name: "."
        recursive-first: yes
        forward-addr: 8.8.8.8

This approach ensures that if recursive querying fails, a request is made to a public DNS, thus refreshing the DNS cache.

As I am not a professional programmer, these are just some of my thoughts and suggestions. I am open to hearing if there are more viable solutions or improvements to my approach.

Dynamic5912 commented 1 month ago

Not an answer - however what is the benefit of setting serve-expired-reply-ttl to zero?

I'm not using Redis or anything - just straight Unbound.

Thanks!

kkkgo commented 1 month ago

Not an answer - however what is the benefit of setting serve-expired-reply-ttl to zero?

I'm not using Redis or anything - just straight Unbound.

Thanks!

serve-expired-reply-ttl setting defines the TTL value used for expired cache responses sent to the client. Setting it to 0 means Unbound will send a response with a TTL of 0 to the client, indicating that the record has expired and needs to be re-queried immediately. This allows expired DNS records to be refreshed as quickly as possible, reducing the retention time of expired records. If the client receives an expired record that is unusable, it can promptly initiate another DNS resolution attempt.

Dynamic5912 commented 1 month ago

Not an answer - however what is the benefit of setting serve-expired-reply-ttl to zero?

I'm not using Redis or anything - just straight Unbound.

Thanks!

serve-expired-reply-ttl setting defines the TTL value used for expired cache responses sent to the client. Setting it to 0 means Unbound will send a response with a TTL of 0 to the client, indicating that the record has expired and needs to be re-queried immediately. This allows expired DNS records to be refreshed as quickly as possible, reducing the retention time of expired records. If the client receives an expired record that is unusable, it can promptly initiate another DNS resolution attempt.

Excellent - thanks for explaining.

Is there any benefit to also using serve-expired-ttl-reset?

I never really could get my head around understanding exactly what this does - and I note that by default it's set to no by Unbound.

gthess commented 1 month ago

Is there any benefit to also using serve-expired-ttl-reset? I never really could get my head around understanding exactly what this does - and I note that by default it's set to no by Unbound.

It does what it says in the man page: https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html#unbound-conf-serve-expired-ttl-reset :)

When serve-expired-ttl is used, it limits the time when an expired record can be used. After that time the expired record although still in the cache is not used. Normally when an expired record is used, a query for said record is also generated to try and fetch fresh data. If upstream is down, those queries will fail to update the record and serve-expired-ttl will eventually be reached rendering the record unusable.

By enabling serve-expired-ttl-reset the failed attempts to fetch fresh data will reset the serve-expired-ttl (that is current time + serve-expired-ttl) so that the expired record is kept usable while there are queries for it, until it is eventually updated by upstream or clients lose interest on the record (not triggering further resets).

Dynamic5912 commented 1 month ago

Is there any benefit to also using serve-expired-ttl-reset?

I never really could get my head around understanding exactly what this does - and I note that by default it's set to no by Unbound.

It does what it says in the man page: https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html#unbound-conf-serve-expired-ttl-reset :)

When serve-expired-ttl is used, it limits the time when an expired record can be used. After that time the expired record although still in the cache is not used. Normally when an expired record is used, a query for said record is also generated to try and fetch fresh data. If upstream is down, those queries will fail to update the record and serve-expired-ttl will eventually be reached rendering the record unusable.

By enabling serve-expired-ttl-reset the failed attempts to fetch fresh data will reset the serve-expired-ttl (that is current time + serve-expired-ttl) so that the expired record is kept usable while there are queries for it, until it is eventually updated by upstream or clients lose interest on the record (not triggering further resets).

Thanks.

I'm still not entirely clear how it works, but ok :)

gthess commented 1 month ago

Ok @Dynamic5912 , another attempt :) Let's assume the starting time is 0 and the following configuration is in effect:

server:
    serve-expired: yes
    serve-expired-ttl: 10
    serve-expired-ttl-reset: no

When a record with a TTL of 5 is cached for the first time, given the current time of 0, we will represent it with the below absolute values:

original-ttl = 5
expired-ttl = 15 (original ttl + serve-expired-ttl)

At time 6 the record is expired but it can still be used because expired-ttl is allowing serving stale data until time 15. At the same time, a query for an expired record would also trigger an upstream query to try and refresh the data. If the upstream query fails to do so (because SERVFAIL or other failures), the expired record will remain the same.

So even though there is demand for the record, because of the upstream failures that we assume happen for the duration of this example, at time 16 the record can no longer be used. This is because the original-ttl (10) dictates that this record is expired and the expired-ttl (15) dictates that this expired record can no longer be used.

If we had used serve-expired-ttl-reset: yes instead, each query for the expired record would reset (update) the expired-ttl value. For example at time 6, we would:

  1. get the client query,
  2. reply with the expired record,
  3. try to resolve,
  4. hit a failure (like SERVFAIL),
  5. update the expired-ttl to: now + serve-expired-ttl.

So we would have the following values in the cache:

original-ttl = 5
expired-ttl = 16 (now + serve-expired-ttl)

In this way whenever an expired record is used, Unbound will eventually either replace the record with fresh data, or update the expired-ttl value so that the expired record can still be used while there is demand for it.


In summary:

If you are not using the serve-expired-ttl* options, Unbound will keep serving expired records as long as they are in the cache.

If you use serve-expired-ttl you restrict the time window an expired record can be used (i.e., don't serve expired records if they are more than expired-ttl old).

If you enable serve-expired-ttl-reset you prolong that time window whenever the expired record could be used but did not result in a fresh record.

Dynamic5912 commented 1 month ago

Ok @Dynamic5912 , another attempt :)

Let's assume the starting time is 0 and the following configuration is in effect:


server:

    serve-expired: yes

    serve-expired-ttl: 10

    serve-expired-ttl-reset: no

When a record with a TTL of 5 is cached for the first time, given the current time of 0, we will represent it with the below absolute values:


original-ttl = 5

expired-ttl = 15 (original ttl + serve-expired-ttl)

At time 6 the record is expired but it can still be used because expired-ttl is allowing serving stale data until time 15. At the same time, a query for an expired record would also trigger an upstream query to try and refresh the data. If the upstream query fails to do so (because SERVFAIL or other failures), the expired record will remain the same.

So even though there is demand for the record, because of the upstream failures that we assume happen for the duration of this example, at time 16 the record can no longer be used.

This is because the original-ttl (10) dictates that this record is expired and the expired-ttl (15) dictates that this expired record can no longer be used.

If we had used serve-expired-ttl-reset: yes instead, each query for the expired record would reset (update) the expired-ttl value.

For example at time 6, we would:

  1. get the client query,

  2. reply with the expired record,

  3. try to resolve,

  4. hit a failure (like SERVFAIL),

  5. update the expired-ttl to: now + serve-expired-ttl.

So we would have the following values in the cache:


original-ttl = 5

expired-ttl = 16 (now + serve-expired-ttl)

In this way whenever an expired record is used, Unbound will eventually either replace the record with fresh data, or update the expired-ttl value so that the expired record can still be used while there is demand for it.


In summary:

If you are not using the serve-expired-ttl* options, Unbound will keep serving expired records as long as they are in the cache.

If you use serve-expired-ttl you restrict the time window an expired record can be used (i.e., don't serve expired records if they are more than expired-ttl old).

If you enable serve-expired-ttl-reset you prolong that time window whenever the expired record could be used but did not result in a fresh record.

Thanks. That makes more sense.

Also to confirm that setting serve-expired-ttl to 0 means the records are retained in the cache and served indefinitely?

gthess commented 1 month ago

Also to confirm that setting serve-expired-ttl to 0 means the records are retained in the cache and served indefinitely?

As long as they are in the cache. This is also the default value currently.

Dynamic5912 commented 1 month ago

Also to confirm that setting serve-expired-ttl to 0 means the records are retained in the cache and served indefinitely?

As long as they are in the cache. This is also the default value currently.

OK.

So how would this setup work - would this indefinitely serve expired cached items, whilst refreshing them in the background to keep them up to date?

cache-min-ttl: 3600 cache-max-ttl: 86400

serve-expired: yes serve-expired-ttl: 0 serve-expired-ttl-reset: yes serve-expired-reply-ttl: 0

prefetch: yes

gthess commented 1 month ago

cache-min-ttl: 3600

Each record will be cached for at least 3600 seconds.

cache-max-ttl: 86400

Each record will be cached for at most 86400 seconds.

serve-expired: yes

Expired records will be considered.

serve-expired-ttl: 0

Expired records will be considered regardless of how old they are.

serve-expired-ttl-reset: yes

Failure to update an expired record will reset their serve-expired-ttl value. (This is useless without specifying a serve-expired-ttl).

serve-expired-reply-ttl: 0

Expired asnwers will have a TTL of 0. (This is discussed properly in https://www.rfc-editor.org/rfc/rfc8767#section-6-6 and argues to use 30 seconds; default in Unbound)

prefetch: yes

When Unbound uses a record as an asnwer and that record is at the last 10% of its TTL, Unbound will also send a query upstream to try and fetch a newer record to replace the cached one.

Dynamic5912 commented 1 month ago

cache-min-ttl: 3600

Each record will be cached for at least 3600 seconds.

cache-max-ttl: 86400

Each record will be cached for at most 86400 seconds.

serve-expired: yes

Expired records will be considered.

serve-expired-ttl: 0

Expired records will be considered regardless of how old they are.

serve-expired-ttl-reset: yes

Failure to update an expired record will reset their serve-expired-ttl value. (This is useless without specifying a serve-expired-ttl).

serve-expired-reply-ttl: 0

Expired asnwers will have a TTL of 0. (This is discussed properly in https://www.rfc-editor.org/rfc/rfc8767#section-6-6 and argues to use 30 seconds; default in Unbound)

prefetch: yes

When Unbound uses a record as an asnwer and that record is at the last 10% of its TTL, Unbound will also send a query upstream to try and fetch a newer record to replace the cached one.

I really appreciate your help/input - and apologies to OP for hijacking their issue/post!

So it looks like my thinking is correct - and serve-expired-ttl-reset is not really needed as I am specifying 0 as serve-expired-ttl rather than another time (i.e. 3600 or 86400 etc.)