NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.06k stars 349 forks source link

When used with systemd-networkd, unbound does not start until systemd-networkd-wait-online.service times out #773

Closed dryya closed 1 year ago

dryya commented 1 year ago

Describe the bug

As described in this arch linux bug report, "unbound waits for the network to be on (as stipulated in its service file) and systemd waits for the DNS resolver to be up before declaring that the network is on. The cycle only breaks when systemd network initialization times out and finally the unbound service file is allowed to start." The behavior started to occur with commit afbc7bb4fec5026f6a1a1487e643b94b2ba1d694 . Unbound and the network still work perfectly fine afterwards, it's just that DNS resolution doesn't come up until after the timeout period for systemd's network target.

To reproduce

On arch linux enable the systemd-networkd and unbound systemd services. Systemd-resolved is disabled. I don't believe it's relevant but I included a minimal resolvconf config file too.

/etc/unbound/unbound.conf
server:
    verbosity: 1
    trust-anchor-file: "/etc/unbound/trusted-key.key"
    tls-cert-bundle: "/etc/ssl/cert.pem"
    tls-system-cert: yes
python:
dynlib:
remote-control:
forward-zone:
    name: "."
    forward-tls-upstream: yes
    forward-addr: 1.1.1.1@853#cloudflare-dns.com
/etc/systemd/network/20-wired.network 
[Match]
Name=enp31s0
[Network]
DHCP=yes
[DHCPv4]
UseDNS=no
[DHCPv6]
UseDNS=no
/etc/resolvconf.conf
name_servers="::1 127.0.0.1"
resolv_conf_options="trust-ad"

Some more information on what's happening via systemd logs:

Output from ❯ systemctl status systemd-networkd-wait-online.service:

× systemd-networkd-wait-online.service - Wait for Network to be Configured
     Loaded: loaded (/usr/lib/systemd/system/systemd-networkd-wait-online.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/systemd-networkd-wait-online.service.d
             └─override.conf
     Active: failed (Result: exit-code) since Sat 2022-10-29 22:49:12 CDT; 13min ago
       Docs: man:systemd-networkd-wait-online.service(8)
    Process: 621 ExecStart=/usr/lib/systemd/systemd-networkd-wait-online (code=exited, status=1/FAILURE)
   Main PID: 621 (code=exited, status=1/FAILURE)
        CPU: 9ms

22:47:12 arch systemd[1]: Starting Wait for Network to be Configured...
22:49:12 arch systemd-networkd-wait-online[621]: Timeout occurred while waiting for network connectivity.
22:49:12 arch systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
22:49:12 arch systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
22:49:12 arch systemd[1]: Failed to start Wait for Network to be Configured.

And you can see via journalctl --boot unbound only begins afterwards:

Oct 29 22:49:12 arch systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Oct 29 22:49:12 arch systemd[1]: Failed to start Wait for Network to be Configured.
Oct 29 22:49:12 arch systemd[1]: Reached target Network is Online.
Oct 29 22:49:12 arch systemd[1]: Starting Validating, recursive, and caching DNS resolver...
Oct 29 22:49:12 arch unbound[1432]: [1432:0] notice: init module 0: subnetcache

System:

Configure line: --prefix=/usr --sysconfdir=/etc --localstatedir=/var --sbindir=/usr/bin --disable-rpath --enable-dnscrypt --enable-dnstap --enable-pie --enable-relro-now --enable-subnet --enable-systemd --enable-tfo-client --enable-tfo-server --enable-cachedb --with-libhiredis --with-conf-file=/etc/unbound/unbound.conf --with-pidfile=/run/unbound.pid --with-rootkey-file=/etc/trusted-key.key --with-libevent --with-libnghttp2 --with-pyunbound Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 1.1.1q 5 Jul 2022 Linked modules: dns64 cachedb subnetcache respip validator iterator DNSCrypt feature available TCP Fastopen feature available

BSD licensed, see LICENSE in source package for details. Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues

wcawijngaards commented 1 year ago

There seems to be a loop in the service file, in that the Wants seems to reference the stuff in the Before, for network-online and also for nss-lookup target. Perhaps the sensible approach would be to fill in the supposed answers here, unbound starts when the network target is done, and this is completed before the network-online target is reached. And also before nss-lookup, to have unbound up before nss-lookup intends to do queries.

This sort of depends on the meaning of the targets and also other systemd set up. Perhaps this change could be good?

diff --git a/contrib/unbound.service.in b/contrib/unbound.service.in
index ada5fac9..5a05c525 100644
--- a/contrib/unbound.service.in
+++ b/contrib/unbound.service.in
@@ -42,9 +42,8 @@
 [Unit]
 Description=Validating, recursive, and caching DNS resolver
 Documentation=man:unbound(8)
-After=network-online.target
-Before=nss-lookup.target
-Wants=network-online.target nss-lookup.target
+After=network.target
+Before=network-online.target nss-lookup.target

 [Install]
 WantedBy=multi-user.target
dryya commented 1 year ago

I can confirm that this works for me on two machines (one using systemd-networkd and one with no network manager, just iwd) - unbound is up and running in three seconds! (I attempted something similar on my own, but I realize now it failed because the standard systemctl edit command won't remove previous Before entries, but instead adds on to them.) Thanks for the quick response!

jm355 commented 1 year ago

That fixed it for me as well!

wcawijngaards commented 1 year ago

The fix is committed to the repo. That should improve the systemd integration scripts for Unbound!

hugleo commented 1 year ago

Maybe it's something related to this commit that when I restart the server the unbound service fails because the ipv6 network still hasn't come up.

unbound[364]: [1673554420] unbound[364:0] error: can't bind socket: Cannot assign requested address for 2001:db8:0:2::2 port 53 unbound[364]: [1673554420] unbound[364:0] fatal error: could not open ports systemd[1]: unbound.service: Main process exited, code=exited, status=1/FAILURE systemd[1]: unbound.service: Failed with result 'exit-code'. systemd[1]: Failed to start Validating, recursive, and caching DNS resolver.

I need to restart de service to bring it up: systemctl restart unbound.service

/etc/systemd/network/ens18.network [Match] Name=ens18

[Address] Address=192.168.0.2/24

[Address] Address=2001:db8:0:2::2/64

[Network] Gateway=192.168.0.1 Gateway=2001:db8:0:2::1 DHCP=no ConfigureWithoutCarrier=Yes

ztNIE commented 2 months ago

Hi @wcawijngaards,

I've encountered an issue where the Unbound service fails to restart on boot, which may be related to the issue you've addressed.

TL;DR: After=network.target doesn't guarantee that interfaces are ready when Unbound attempts to bind to them. Changing the configuration to After=network-online.target appears to be the correct fix.

Details: I have a custom dummy interface with IP 10.1.1.1, and Unbound cannot bind to it during boot time because the interface isn't ready yet. I fixed this issue by modifying the unit file (I'm using Unbound 1.rocky8 and Unbound 1.16.2) to this:

[Unit]
Description=Unbound recursive Domain Name Server
After=network.target
After=network-online.target    # This is the line I added
Before=nss-lookup.target

Before I changed the unit file (After=network.target), unbound cannot start at boot time:

Jul 08 15:07:19 sre-pdns-primary systemd[1]: Starting Unbound recursive Domain Name Server...
Jul 08 15:07:19 sre-pdns-primary unbound-checkconf[844]: unbound-checkconf: no errors in /etc/unbound/unbound.conf
Jul 08 15:07:19 sre-pdns-primary systemd[1]: Started Unbound recursive Domain Name Server.
Jul 08 15:07:19 sre-pdns-primary unbound[855]: [1720415239] unbound[855:0] error: can't bind socket: Cannot assign requested address for 10.1.1.1 port 53
Jul 08 15:07:19 sre-pdns-primary unbound[855]: [1720415239] unbound[855:0] fatal error: could not open ports
Jul 08 15:07:19 sre-pdns-primary systemd[1]: unbound.service: Main process exited, code=exited, status=1/FAILURE
Jul 08 15:07:19 sre-pdns-primary systemd[1]: unbound.service: Failed with result 'exit-code'.

After I changed the unit file (After=network-online.target)

Jul 08 15:10:11 sre-pdns-primary systemd[1]: Starting Unbound recursive Domain Name Server...
Jul 08 15:10:11 sre-pdns-primary unbound-checkconf[2484]: unbound-checkconf: no errors in /etc/unbound/unbound.conf
Jul 08 15:10:11 sre-pdns-primary systemd[1]: Started Unbound recursive Domain Name Server.
Jul 08 15:10:11 sre-pdns-primary unbound[2489]: [1720415411] unbound[2489:0] debug: chdir to /etc/unbound
Jul 08 15:10:11 sre-pdns-primary unbound[2489]: [1720415411] unbound[2489:0] debug: drop user privileges, run as unbound
Jul 08 15:10:11 sre-pdns-primary unbound[2489]: [1720415411] unbound[2489:0] debug: switching log to /var/log/unbound/unbound.log

According to RHEL's documentation, network.target means that the service for setting up the network has started but doesn't guarantee that it's ready. In contrast, network-online.target is only reached after the network is connected, which seems to be the appropriate option for this use case.

In most cases, the current setting works because interfaces are up faster than Unbound tries to bind to them. However, there's a chance that interfaces become slow, causing Unbound not to start at boot time. Many users modify their own systemd unit file to fix this (it's more likely to happen with custom interfaces). Changing After=network.target to After=network-online.target may address the root cause of this issue.

hugleo commented 2 months ago

Not facing the problem for ipv4. But for ipv6 the root cause seems to be DAD. A workaround is to disable it with: net.ipv6.conf.xxx.accept_dad = 0

wcawijngaards commented 2 months ago

The commit https://github.com/NLnetLabs/unbound/commit/d43760a8cd7d01f59fd73bf7edbf983903d8a142 adds the network-online.target to the contrib/unbound.service.in and contrib/unbound_portable.service.in unit files. Another workaround for avoiding the problem could be to set ip-freebind: yes, that allows using interfaces that are down, or ip-transparent: yes, by the way.