Closed chiak597 closed 4 years ago
Looks like this is because the upstream does not respond, after a reboot. Is there some sort of slow firewall or connection setup?
In any case, you can get more information by setting unbounds verbosity higher, with the verbosity option. To about 4. Then unbound logs every query and response from the upstream. And likely then logs timeouts. But it could be some other failure (eg. cannot create socket?). So it could be useful to look at that.
Since the logs you show say the upstream fails to respond, unbound must have timed out the servers. After some time, unbound probes again to the upstreams and if they respond again then it can work. For this unbound needs some client queries. But it is on a timer, and it is not really fast because otherwise unbound would bother servers that are down. There is exponential backoff, from milliseconds and then it ends up at minutes or even 15 minutes or more.
You could change the infra-host-ttl: 900
(seconds) setting that determines how long unbound caches the reachability and rtt ping time of the upstream servers.
If there is another process, like router negotiation, you could unbound-control flush_infra all
that wipes the reachability information, once the connectivity set up process is complete.
One failed request (internet connection was already up at that time): failed-request.log
I have changed infra-host-ttl
to 60
and it helped a lot. Thanks for the hint.
The logs indicate that the servers have not responded for a long time and are now blacklisted in the infra cache from being used. And then it serves servfail.
But you say that service has resumed again. Reducing that ttl makes unbound check faster if service is available again. Good to see that this helps.
So not just unbound fails to do lookups, but the machine cannot connect to the destination IP addresses, apparently for long periods.
The logs indicate that the servers have not responded for a long time and are now blacklisted in the infra cache from being used. And then it serves servfail.
Ok, this makes sense (as full router reboot takes usually 3-5 minutes). Thank you very much.
I am using Unbound 1.10.1 on Debian as a simple forwarding server with one forward zone and two DoT forward servers. When my router is rebooted (for example due to power glitch), all subsequent requests are failing with SERVFAIL, even when the internet connectivity is up again:
At the same time, I can see no connections to the forward servers:
Unbound restart is required in order to get it working again.
Version: