After router reboot, all requests fail with SERVFAIL until Unbound is restarted

chiak597 commented 4 years ago

I am using Unbound 1.10.1 on Debian as a simple forwarding server with one forward zone and two DoT forward servers. When my router is rebooted (for example due to power glitch), all subsequent requests are failing with SERVFAIL, even when the internet connectivity is up again:

Aug 03 17:17:19 unbound[289573:0] error: SERVFAIL <realtime.services.disqus.com. AAAA IN>: all the configured stub or forward servers failed, at zone .
Aug 03 17:17:20 unbound[289573:0] error: SERVFAIL <realtime.services.disqus.com. A IN>: all the configured stub or forward servers failed, at zone .
Aug 03 17:17:30 unbound[289573:0] error: SERVFAIL <www.wunderground.com. A IN>: all the configured stub or forward servers failed, at zone .
Aug 03 17:17:30 unbound[289573:0] error: SERVFAIL <www.wunderground.com. AAAA IN>: all the configured stub or forward servers failed, at zone .
Aug 03 17:17:31 unbound[289573:0] error: SERVFAIL <api.mapbox.com. A IN>: all the configured stub or forward servers failed, at zone .
Aug 03 17:17:31 unbound[289573:0] error: SERVFAIL <api.mapbox.com. AAAA IN>: all the configured stub or forward servers failed, at zone .

At the same time, I can see no connections to the forward servers:

# netstat -tanpu | grep 853
#

Unbound restart is required in order to get it working again.

Version:

$ /usr/sbin/unbound -V
Version 1.10.1

Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --disable-rpath --with-pidfile=/run/unbound.pid --with-rootkey-file=/var/lib/unbound/root.key --with-libevent --with-pythonmodule --enable-subnet --enable-dnstap --enable-systemd --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --libdir=/usr/lib
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 1.1.1g  21 Apr 2020
Linked modules: dns64 python subnetcache respip validator iterator

BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues

wcawijngaards commented 4 years ago

Looks like this is because the upstream does not respond, after a reboot. Is there some sort of slow firewall or connection setup?

In any case, you can get more information by setting unbounds verbosity higher, with the verbosity option. To about 4. Then unbound logs every query and response from the upstream. And likely then logs timeouts. But it could be some other failure (eg. cannot create socket?). So it could be useful to look at that.

Since the logs you show say the upstream fails to respond, unbound must have timed out the servers. After some time, unbound probes again to the upstreams and if they respond again then it can work. For this unbound needs some client queries. But it is on a timer, and it is not really fast because otherwise unbound would bother servers that are down. There is exponential backoff, from milliseconds and then it ends up at minutes or even 15 minutes or more.

You could change the infra-host-ttl: 900 (seconds) setting that determines how long unbound caches the reachability and rtt ping time of the upstream servers.

If there is another process, like router negotiation, you could unbound-control flush_infra all that wipes the reachability information, once the connectivity set up process is complete.

chiak597 commented 4 years ago

One failed request (internet connection was already up at that time): failed-request.log

I have changed infra-host-ttl to 60 and it helped a lot. Thanks for the hint.

wcawijngaards commented 4 years ago

The logs indicate that the servers have not responded for a long time and are now blacklisted in the infra cache from being used. And then it serves servfail.

But you say that service has resumed again. Reducing that ttl makes unbound check faster if service is available again. Good to see that this helps.

So not just unbound fails to do lookups, but the machine cannot connect to the destination IP addresses, apparently for long periods.

chiak597 commented 4 years ago

The logs indicate that the servers have not responded for a long time and are now blacklisted in the infra cache from being used. And then it serves servfail.

Ok, this makes sense (as full router reboot takes usually 3-5 minutes). Thank you very much.

NLnetLabs / unbound

After router reboot, all requests fail with SERVFAIL until Unbound is restarted #282