SERVFAIL problems with unbound 1.15.0 running on OpenBSD 7.1

renaudallard commented 2 years ago

Describe the bug Since I upgraded my DNS servers to OpenBSD 7.1 with unbound 1.15.0, I have a lot of issues with DNS resolution (without changing anything in the config). I randomly get SERVFAIL for a lot of names, or something even stranger like some addresses and SERVFAIL for others (see dashlane example). The exact same config was working just fine with unbound 1.13.0 on OpenBSD 7.0. Note that it does not happen every time, it's quite sporadic.

Examples: host dashlane.com dashlane.com has address 65.9.82.43 dashlane.com has address 65.9.82.13 dashlane.com has address 65.9.82.36 dashlane.com has address 65.9.82.97 Host dashlane.com not found: 2(SERVFAIL) Host dashlane.com not found: 2(SERVFAIL)

host forum.opnsense.org Host forum.opnsense.org not found: 2(SERVFAIL)

To reproduce Here is my partial config server: log-replies: yes interface: 0.0.0.0@853 tls-port: 853 tls-service-pem: tls-service-key: outgoing-range: 8192 outgoing-num-tcp: 256 incoming-num-tcp: 256 serve-expired: yes outbound-msg-retry: 5 cache-max-negative-ttl: 1 msg-cache-size: 64m msg-cache-slabs: 4 num-queries-per-thread: 32 rrset-cache-size: 128m rrset-cache-slabs: 4 infra-cache-slabs: 4 key-cache-slabs: 4 access-control: 0.0.0.0/0 allow access-control: ::0/0 allow hide-identity: yes hide-version: yes harden-short-bufsize: yes harden-large-queries: yes harden-glue: yes harden-dnssec-stripped: yes harden-below-nxdomain: yes harden-referral-path: yes use-caps-for-id: yes qname-minimisation: yes aggressive-nsec: yes edns-tcp-keepalive: yes so-reuseport: no deny-any: yes prefetch: yes prefetch-key: yes rrset-roundrobin: yes minimal-responses: yes

Expected behavior Names should resolve properly

System:

Unbound version: 1.15.0
OS: OpenBSD 7.1
unbound -V output: Version 1.15.0 Configure line: --enable-allsymbols --with-ssl=/usr --with-libevent=/usr --with-libexpat=/usr --without-pythonmodule --with-chroot-dir=/var/unbound --with-pidfile= --with-rootkey-file=/var/unbound/db/root.key --with-conf-file=/var/unbound/etc/unbound.conf --with-username=_unbound --disable-shared --disable-explicit-port-randomisation --without-pthreads Linked libs: pluggable-libevent 1.4.15-stable (it uses kqueue), LibreSSL 3.5.2 Linked modules: dns64 respip validator iterator

Additional information I tried removing every "hardening" parameter one by one to get nearer to the stock config. And it seems that disabling use-caps-for-id solves the issue.

wcawijngaards commented 2 years ago

For this query name I cannot reproduce issues with 0x20, the logs say the 0x20 replies are good. Looking at your config, I can make a guess, but it is pretty vague. Your num-queries-per-thread: 32 setting is very low. That should be adjusted to like 1024, or use the default. The setting could cause servfail, when the num queries are full, and another query comes in, the not fitting query can be rejected with SERVFAIL. How that could happen, you have prefetch, you query for the name, this gives you an answer, and starts the prefetch. Somehow with 0x20 enabled this takes longer than without 0x20, because some names have 0x20 fallback and then unbound has to query a number of hosts, like 5 queries or so, and that is more work. This makes the entry exist for a little longer. Then your second query comes in, but the list of queries is full now, with the previous one. And the new query is rejected. That would only happen when the list is full, so there is a bunch of lookups. And perhaps also in combination with prefetch and a name for which 0x20 backoff has to be performed.

If this happens, that should be shown in the logs. Maybe at higher verbosity settings. That makes the logs show what happens and why servfail is happening. You could try to fix the num queries per thread setting, or capture the event with verbosity in logs, or if logs show output now. There is also log-servfail: yes that prints summary information about servfail replies to log, that you could enable to get more information about these servfails.

renaudallard commented 2 years ago

I have been able to get the error, even without the num-queries-per-thread limit. I have enabled the log-servfail, and I get this: Apr 28 10:14:08 elendil unbound: [14984:0] error: SERVFAIL <f.root-servers.net. AAAA IN>: could not fetch nameservers for 0x20 fallback Apr 28 10:14:08 elendil unbound: [14984:0] info: 127.0.0.1 a.root-servers.net. A IN NOERROR 0.000000 1 52 Apr 28 10:14:09 elendil unbound: [14984:0] error: SERVFAIL : could not fetch nameservers for 0x20 fallback Apr 28 10:14:09 elendil unbound: [14984:0] info: 127.0.0.1 www.hubo.be. A IN SERVFAIL 0.185887 0 29

wcawijngaards commented 2 years ago

The error you have can happen if. Out of memory when making the structure. The added query makes a dependency cycle, eg. this lookup is a recursion for that nameserver lookup already. Or if the added query exceeds boundaries on number of allowed sub-requests, eg. no more than X allowed queries per incoming query, those types of limits. The failure that it is specifically is also logged, perhaps at a higher verbosity.

wcawijngaards commented 2 years ago

It could be the combination of 0x20 fallback and the harden-referral-path: yes option you have. Both of those create new subrequests in the state machine, and those increase those counters that fail to add more subqueries once the limit is reached, and then the 0x20 stops because it cannot perform the fallback any more.

wcawijngaards commented 2 years ago

Also prefetch-key creates additional sub requests. Perhaps the max-negative-ttl of 1 second is also causing refetches of like nameserver information, in particular non existing ipv6 AAAA information, that causes extra subqueries to be made.

renaudallard commented 2 years ago

The max-negative-ttl=1 was added afterwards to check if this would solve the issue, but it occured independently of that parameter. Removing harden-referral-path also didn't change the behaviour. I am trying to replicate with verbosity=4, but, as this is sporadic, it's not easy.

wcawijngaards commented 2 years ago

Another possibility is that it is out of memory. I would expect log entries from that. But I have seen OpenBSD set a heap size ulimit of 256m for processes, and then unbound, with this configuration could run out of memory. It has 128m and 64m cache sizes, and those are the sizes calculated by the data stored, not the overhead of the malloc allocator, and other stuff in memory, that could slowly fill up to the max and then out of memory could happen.

renaudallard commented 2 years ago

Here is a log of a failure with verbosity=4 unbound.log

This is taken from a freshly fully restarted unbound not being busy.

wcawijngaards commented 2 years ago

It looks like it fails because the query depth exceeds the target fetch policy max depth. Have you set the target-fetch-policy to something non default? That parameter needs to have a number of integers in it, otherwise this fetch would exceed the target fetch depth and fail. And not print anything in logs for it. I see no other log printout to indicate what caused it.

Otherwise I see it try a lot of ipv6 addresses, and also during 0x20 fallback. do-ip6: no could limit that.

renaudallard commented 2 years ago

target-fetch-policy is not defined, so it's using the default. The machine has indeed no ipv6, so putting do-ip6: no would indeed be a good idea.

wcawijngaards commented 2 years ago

Is your syslog omitting printout from the logs because of log volume? I also miss printout of lines like target fetch policy for level at startup. Perhaps a logfile: <file> directive would not omit the log lines.

renaudallard commented 2 years ago

Log volume on this machine is very low, so I would not expect syslog to drop the lines, but I can retry with a logfile: <file>

renaudallard commented 2 years ago

Interesting, now that I did put do-ip6: no, I am unable to reproduce the issue.

renaudallard commented 2 years ago

Could it be somehow related to https://github.com/NLnetLabs/unbound/commit/956d7d4e44f15dd53241e3e15c60d91af79743e8 ?

wcawijngaards commented 2 years ago

No I do not think that commit is related.

I think it may be because the query_for_targets function checks for depth before a 0-return path. I would like to see the complete logs, though. That no ip6 stops it, fits with it being the issue.

renaudallard commented 2 years ago

Here is a log (not with syslog) with both a success and a failure for www.hubo.be unbound.log

wcawijngaards commented 2 years ago

The commit is an attempt to fix this problem. The logs from file are very complete, thank you. The fix attempts to check for target limits after it has determined that there is work to do. So that when ipv6 fails, and then it retries the 0x20 backoff procedure, the query target limits are not exceeded, because they have already been done, allowing the 0x20 backoff procedure to continue.

lohphat commented 2 years ago

FYI, this issue has cropped-up in a pfSense community support when several installations were reporting of slow/broken DNS lookups since it's using unbound 1.15.0 as part of the base FreeBSD install for the pfSense 22.05 release. So for us it's an upstream component problem. If you'd like to review the findings, logs, and triage process, here's the discussion thread link: https://forum.netgate.com/topic/173148/slow-dns-after-22-05

NLnetLabs / unbound

SERVFAIL problems with unbound 1.15.0 running on OpenBSD 7.1 #670