NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.14k stars 359 forks source link

fatal error: could not open autotrust file for writing, /var/lib/unbound/root.key.2602864-1-560c3a7457d0: Permission denied #447

Closed SomePersonSomeWhereInTheWorld closed 2 years ago

SomePersonSomeWhereInTheWorld commented 3 years ago

Any reason why this started to happen?

fatal error: could not open autotrust file for writing, /var/lib/unbound/root.key.2602864-1-560c3a7457d0: Permission denied

unbound-libs-1.13.1-1.fc33.x86_64
python3-unbound-1.13.1-1.fc33.x86_64
unbound-1.13.1-1.fc33.x86_64
wcawijngaards commented 3 years ago

Unbound periodically writes to the root.key file, to process the updates for RFC5011 updates. That is what you have configured.

The error is that unbound wants to create a temporary file to write the new contents of the /var/lib/unbound/root.key into. After that unbound is going to remove the old root.key file and rename this file to the root.key file. But unbound gets permission denied when it attempts to write there. Perhaps the directory has no write permissions (for the unbound user), or you have selinux or apparmor or freebsd capabilities configured to deny write access from unbound to this directory?

SomePersonSomeWhereInTheWorld commented 3 years ago

I try to chown unbound:unbound but after restarting the unbound process it goes back to root;

ls -l /var/lib/unbound/root.key
-rw-r--r-- 1 root root 759 Mar 19 10:24 /var/lib/unbound/root.key

No SELinux, no AppArmor, and you can see from my rpm command it's Fedora 33.

Now I see:

fatal error: could not open autotrust file for writing, /var/lib/unbound/root.key.2653680-3-56517c61b7d0: Permission denied

SomePersonSomeWhereInTheWorld commented 3 years ago

Not sure if this is the long term fix but I ranchown -R unbound:unbound /var/lib/unbound and restarted unbound but I get these error: read (in tcp s): Connection refused as mentioned in issue 360.

Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 199.180.182.53 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 199.180.182.53 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 196.216.169.10 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 196.216.169.10 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 193.0.9.1 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 200.10.60.53 port 53
Mar 19 12:15:52 ourserver unbound[2660803]: [2660803:3] error: read (in tcp s): Connection refused for 200.10.60.53 port 53

TCP Fast Open is disabled. How can I debug this?

dig +tcp @193.0.9.1
;; Connection to 193.0.9.1#53(193.0.9.1) for . failed: connection refused.

Edit I see a user having this with Netgate.

wcawijngaards commented 3 years ago

The chown of the directory worked, and that fixed your permission denied failure that you had.

The connection refused, is a spammy message that says unbound cannot make tcp connections. I have made it less visible, by only making it print when verbosity is high. This is committed to the repository.

It seems you cannot make TCP connections, and dig also cannot make tcp connections to those IP addresses from the commandline. I also do not know why the connections are refused by your upstream. Perhaps your firewall is refusing to allow tcp connections there, or another firewall is refusing tcp connections. This is a separate issue from your permission denied issue.

SomePersonSomeWhereInTheWorld commented 3 years ago

Once I removed the -doption from /usr/lib/systemd/system/unbound.service in ExecStartthe TCP timeout issues also disappeared from the logs. Perhaps this was already done upstream?

unbound-libs-1.13.1-1.fc33.x86_64
python3-unbound-1.13.1-1.fc33.x86_64
unbound-1.13.1-1.fc33.x86_64

Anyways, as for the original issue, yes it's fixed. I haven't tried a reboot but I'm assuming it should be persistent. But I don't see this documented anywhere to change the permission in the/var/lib/unbounddirectory. Should that be part of the installation process?

wcawijngaards commented 3 years ago

Not sure how the removal of the -d option fixes your issue, perhaps what happened was you reloaded your new config file in the process with fixed options in it? Or it fixed it somehow, the unbound.service file is part of the unbound rpm that you are using.

Also the var/lib/unbound directory is created by the rpm and also the configuration that unbound uses that directory to store files is part of the rpm configure script (or default configuration shipped with the package). But we do not do the package scripts, that is handled by the distro that you are using. A look at the unbound.spec file shows that it performs a chown for the user unbound on install %dir %attr(0755,unbound,unbound) %{_sharedstatedir}/%{name}, so I do not know what is wrong with that.

SomePersonSomeWhereInTheWorld commented 3 years ago

Not sure how the removal of the -d option fixes your issue, perhaps what happened was you reloaded your new config file in the process with fixed options in it? Or it fixed it somehow, the unbound.service file is part of the unbound rpm that you are using.

I can confirm that on a second server running unbound, that when I removed the -d option from the /usr/lib/systemd/system/unbound.service all of thoseerror: read (in tcp s): Connection refused errors disappeared.

wcawijngaards commented 3 years ago

The -d option, documented with -d do not fork into the background., makes unbound not fork into a new process. So with -d it stays attached to the console on which it was started. This is what systemd wants because systemd is monitoring that console.

Systemd may also have set other permissions on that console, did it? Would that have caused the different results for tcp connections for the program? Without the -d option unbound forks into a process, but I do not really see what difference that could have made to the tcp connections, so that it gives no connection refused.

SomePersonSomeWhereInTheWorld commented 3 years ago

Systemd may also have set other permissions on that console, did it?

Can you clarify what you are looking for? Here is what is in the /usr/lib/systemd/system/unbound.service file.

[Unit]
Description=Unbound recursive Domain Name Server
After=network.target
After=unbound-keygen.service
Wants=unbound-keygen.service
Wants=unbound-anchor.timer
Before=nss-lookup.target
Wants=nss-lookup.target

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/unbound
ExecStartPre=/usr/sbin/unbound-checkconf
ExecStartPre=-/usr/sbin/unbound-anchor -a /var/lib/unbound/root.key -c /etc/unbound/icannbundle.pem -f /etc/resolv.conf -R
ExecStart=/usr/sbin/unbound  $UNBOUND_OPTIONS
ExecReload=/usr/sbin/unbound-control reload

[Install]
WantedBy=multi-user.target
 /etc/sysconfig/unbound
# for extra debug, add "-v -v" or change verbosity: in unbound.conf

UNBOUND_OPTIONS=""
wcawijngaards commented 3 years ago

From this I see no reason why -d would have made a difference to TCP connections.

SomePersonSomeWhereInTheWorld commented 3 years ago

After I removed our DNS servers in /etc/resolv.conf unbound stops working:

unbound-checkconf
unbound-checkconf: no errors in /etc/unbound/unbound.conf
# unbound-control dump_cache
[1617035678] unbound-control[3348867:0] error: connect: Connection refused for 127.0.0.1 port 8953

# systemctl status unbound
● unbound.service - Unbound recursive Domain Name Server
     Loaded: loaded (/usr/lib/systemd/system/unbound.service; enabled; vendor preset: disabled)
     Active: inactive (dead) since Mon 2021-03-29 12:31:33 EDT; 3min 12s ago
    Process: 3348322 ExecStartPre=/usr/sbin/unbound-checkconf (code=exited, status=0/SUCCESS)
    Process: 3348323 ExecStartPre=/usr/sbin/unbound-anchor -a /var/lib/unbound/root.key -c /etc/unbound/icannbundle.pem -f /etc/resolv.conf -R (code=exited, status=0/SUCCESS)
    Process: 3348373 ExecStart=/usr/sbin/unbound $UNBOUND_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 3348373 (code=exited, status=0/SUCCESS)
        CPU: 148ms

Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: service stopped (unbound 1.13.1).
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server stats for thread 0: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server stats for thread 2: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Mar 29 12:31:33 ourserver unbound[3348374]: [3348374:0] info: server

netstat -nltup | grep 'Proto\|:8953 '
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
unbound-control status
[1617035844] unbound-control[3349422:0] error: connect: Connection refused for 127.0.0.1 port 8953
unbound is stopped

Port 8953 is open. What else is wrong?

SomePersonSomeWhereInTheWorld commented 3 years ago

OK a little embarrassing here. The -doption is definitely needed. interface-automatic was set to the default, no.

        # NOTE: Enable this option when specifying interface 0.0.0.0 or ::0
        # NOTE: Disabled per Fedora policy not to listen to * on default install
        # NOTE: If deploying on non-default port, eg 80/443, this needs to be disabled
        interface-automatic: yes

I'm still getting theerror: read (in tcp s): Connection refused for errors but at least unboundis working.

dig +tcp @193.0.9.1
;; Connection to 193.0.9.1#53(193.0.9.1) for . failed: connection refused.
wcawijngaards commented 3 years ago

So dig fails to make a tcp connection. Unbound also fails to make that tcp connection. Your system does not make tcp connections.

Perhaps a firewall issue, on this device or on another device?

SomePersonSomeWhereInTheWorld commented 3 years ago

So dig fails to make a tcp connection. Unbound also fails to make that tcp connection. Your system does not make tcp connections.

Perhaps a firewall issue, on this device or on another device?

Indeed our IT group confirmed they started blocking all non-UDP requests to DNS. They are looking at allowing it back just for our department.

You previously mentioned TCP Fast Open. Does unbound take advantage of it and any place I can share documentation for it?

Also if TCP is blocked for IPs outside of our subnet, would that cause dnf update to fail with the following?

Error: Error downloading packages:
  Curl error (6): Couldn't resolve host name for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f33&arch=x86_64 [Could not resolve host: mirrors.fedoraproject.org]
wcawijngaards commented 3 years ago

That should solve those tcp error numbers. Otherwise you could configure unbound to not allow it to use tcp any more. Those resolutions would still fail then, but unbound would not complain in logs.

Unbound can use TCP fast open. The feature is enabled at ./configure time, and ./configure --help shows output about it. unbound -V may then contain the configure line reference to that option.

Yes it could cause this failure, if that name happens to use a very large answer that does not fit in a UDP reply. DNS servers then try to change over to TCP. If that fails, it would cause unbound to log that failure and then also return a failure to the caller, and that could, I guess, result in a line like you quote, although that could also have other causes.

SomePersonSomeWhereInTheWorld commented 3 years ago

Yes it could cause this failure, if that name happens to use a very large answer that does not fit in a UDP reply. DNS servers then try to change over to TCP. If that fails, it would cause unbound to log that failure and then also return a failure to the caller, and that could, I guess, result in a line like you quote, although that could also have other causes.

Is there a way, e.g., with tcpdump, to verify this is what's happening?

wcawijngaards commented 3 years ago

Yes, for example set unbound to a high verbosity, like 4 or so, and then inspect the logs about that name. See if it got a truncated answer.

Or use dig, dig @ the ip address of the authority server for that domain name, and see if the answer is larger than about 1400-1500 bytes, or dig fails like unbound with tcp connection failure as it also retries the fetch it over TCP if it is large. This may also be a query of type DNSKEY for the zone name of that query, those are frequently large too. Dig has an option +ignore that stops it from following the truncation flag, it can then print a result with +TC flag in the header indicating that dig without ignore and also unbound would have retried with TCP.

SomePersonSomeWhereInTheWorld commented 3 years ago

Perhaps it's not TCP but DNSSEC that is expected? See logs.

Mar 31 09:42:36 unbound[3903467:1] info: sending query: fedoraproject.org. A IN
Mar 31 09:42:36 unbound[3903467:1] debug: sending to target: <.> 2001:500:a8::e#53
Mar 31 09:42:36 unbound[3903467:1] debug: dnssec status: expected
Mar 31 09:42:36 unbound[3903467:1] debug: EDNS lookup known=0 vs=0
Mar 31 09:42:36 unbound[3903467:1] debug: serviced query UDP timeout=376 msec
Mar 31 09:42:36 unbound[3903467:1] debug: inserted new pending reply id=ed4f
Mar 31 09:42:36 unbound[3903467:1] debug: opened UDP if=0 port=55718
Mar 31 09:42:36 unbound[3903467:1] error: udp connect failed: Network is unreachable for 2001:500:a8::e port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] info: error sending query to auth server ip6 2001:500:a8::e port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] debug: iter_handle processing q with state QUERY TARGETS STATE
Mar 31 09:42:36 unbound[3903467:1] info: processQueryTargets: mirrors.fedoraproject.org. A IN
Mar 31 09:42:36 unbound[3903467:1] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
Mar 31 09:42:36 unbound[3903467:1] info: DelegationPoint<.>: 13 names (0 missing), 26 addrs (26 result, 0 avail) cacheNS
Mar 31 09:42:36 unbound[3903467:1] info:   f.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info: auth_zone . query fedoraproject.org. A, domain NULL notexact notexist, ce NULL, rrset NULL
Mar 31 09:42:36 unbound[3903467:1] debug: attempt to get extra 3 targets
Mar 31 09:42:36 unbound[3903467:1] debug: servselect ip4 192.5.5.241 port 53 (len 16)
Mar 31 09:42:36 unbound[3903467:1] debug:    rtt=163
Mar 31 09:42:36 unbound[3903467:1] debug:    ip6 2001:503:c27::2:30 port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip4 192.58.128.30 port 53 (len 16)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip6 2001:7fd::1 port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip4 193.0.14.129 port 53 (len 16)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip6 2001:500:12::d0d port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip4 192.112.36.4 port 53 (len 16)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip6 2001:dc3::35 port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip4 202.12.27.33 port 53 (len 16)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip6 2001:500:2f::f port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] debug:    ip4 192.5.5.241 port 53 (len 16)
Mar 31 09:42:36 unbound[3903467:1] debug: iter_handle processing q with state INIT REQUEST STATE (stage 2)
Mar 31 09:42:36 unbound[3903467:1] info: resolving (init part 2):  mirrors.fedoraproject.org. AAAA IN
Mar 31 09:42:36 unbound[3903467:1] debug: iter_handle processing q with state INIT REQUEST STATE (stage 3)
Mar 31 09:42:36 unbound[3903467:1] info: resolving (init part 3):  mirrors.fedoraproject.org. AAAA IN
Mar 31 09:42:36 unbound[3903467:1] info: schedule dnskey prefetch . DNSKEY IN
Mar 31 09:42:36 unbound[3903467:1] debug: iter_handle processing q with state QUERY TARGETS STATE
Mar 31 09:42:36 unbound[3903467:1] info: processQueryTargets: mirrors.fedoraproject.org. AAAA IN
Mar 31 09:42:36 unbound[3903467:1] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
Mar 31 09:42:36 unbound[3903467:1] info: DelegationPoint<.>: empty
Mar 31 09:42:36 unbound[3903467:1] debug: removing 2 labels
Mar 31 09:42:36 unbound[3903467:1] info: auth_zone . query org. A, domain NULL notexact notexist, ce NULL, rrset NULL
Mar 31 09:42:36 unbound[3903467:1] debug: iter_handle processing q with state INIT REQUEST STATE
Mar 31 09:42:36 unbound[3903467:1] info: resolving mirrors.fedoraproject.org. AAAA IN
Mar 31 09:42:36 unbound[3903467:1] debug: request has dependency depth of 0
Mar 31 09:42:36 unbound[3903467:1] debug: cache delegation returns delegpt
Mar 31 09:42:36 unbound[3903467:1] info: DelegationPoint<.>: 13 names (0 missing), 26 addrs (0 result, 26 avail) cacheNS
Mar 31 09:42:36 unbound[3903467:1] info: sending query: fedoraproject.org. A IN
Mar 31 09:42:36 unbound[3903467:1] debug: sending to target: <.> 2001:500:1::53#53
Mar 31 09:42:36 unbound[3903467:1] debug: dnssec status: expected
Mar 31 09:42:36 unbound[3903467:1] debug: EDNS lookup known=0 vs=0
Mar 31 09:42:36 unbound[3903467:1] debug: serviced query UDP timeout=376 msec
Mar 31 09:42:36 unbound[3903467:1] debug: inserted new pending reply id=32fb
Mar 31 09:42:36 unbound[3903467:1] debug: opened UDP if=0 port=56944
Mar 31 09:42:36 unbound[3903467:1] error: udp connect failed: Network is unreachable for 2001:500:1::53 port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] info: error sending query to auth server ip6 2001:500:1::53 port 53 (len 28)
Mar 31 09:42:36 unbound[3903467:1] debug: iter_handle processing q with state QUERY TARGETS STATE
Mar 31 09:42:36 unbound[3903467:1] info: processQueryTargets: mirrors.fedoraproject.org. AAAA IN
Mar 31 09:42:36 unbound[3903467:1] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
Mar 31 09:42:36 unbound[3903467:1] info: DelegationPoint<.>: 13 names (0 missing), 26 addrs (26 result, 0 avail) cacheNS
Mar 31 09:42:36 unbound[3903467:1] info:   f.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   m.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   g.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   k.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   j.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   b.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   c.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   d.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   a.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   i.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   l.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   h.root-servers.net. * A AAAA
Mar 31 09:42:36 unbound[3903467:1] info:   e.root-servers.net. * A AAAA

and then:

Mar 31 10:12:01 unbound[3909274:3] info: validation failure <mirrors.fedoraproject.org. AAAA IN>: No DNSKEY record for key fedoraproject.org. while building chain of trust

logs from tcpdump -n -s 1500 udp and port 53 -v:

10:14:59.871494 IP (tos 0x0, ttl 64, id 36015, offset 0, flags [none], proto UDP (17), length 82)
    x.x.x.x.57451 > 85.236.55.10.domain: 21341% [1au] A? mirrors.fedoraproject.org. (54)
10:14:59.871670 IP (tos 0x0, ttl 64, id 38142, offset 0, flags [none], proto UDP (17), length 74)
    x.x.x.x.39429 > 38.145.60.13.domain: 51344% [1au] DNSKEY? fedoraproject.org. (46)
10:14:59.872004 IP (tos 0x0, ttl 64, id 38143, offset 0, flags [none], proto UDP (17), length 82)
    x.x.x.x.37014 > 38.145.60.13.domain: 29471% [1au] AAAA? mirrors.fedoraproject.org. (54)
10:14:59.904828 IP (tos 0x0, ttl 56, id 54889, offset 0, flags [none], proto UDP (17), length 74)
    38.145.60.13.domain > x.x.x.x.39429: 51344*-| 0/0/1 (46)
10:14:59.905754 IP (tos 0x0, ttl 56, id 54890, offset 0, flags [none], proto UDP (17), length 917)
    38.145.60.13.domain > x.x.x.x.37014: 29471*- 11/0/1 mirrors.fedoraproject.org. CNAME wildcard.fedoraproject.org., mirrors.fedoraproject.org. RRSIG, mirrors.fedoraproject.org. RRSIG, wildcard.fedoraproject.org. AAAA 2605:bc80:3010:600:dead:beef:cafe:fed9, wildcard.fedoraproject.org. AAAA 2604:1580:fe00:0:dead:beef:cafe:fed1, wildcard.fedoraproject.org. AAAA 2620:52:3:1:dead:beef:cafe:fed7, wildcard.fedoraproject.org. AAAA 2605:bc80:3010:600:dead:beef:cafe:feda, wildcard.fedoraproject.org. AAAA 2610:28:3090:3001:dead:beef:cafe:fed3, wildcard.fedoraproject.org. AAAA 2620:52:3:1:dead:beef:cafe:fed6, wildcard.fedoraproject.org. RRSIG, wildcard.fedoraproject.org. RRSIG (889)
10:14:59.906008 IP (tos 0x0, ttl 64, id 31976, offset 0, flags [none], proto UDP (17), length 83)
    x.x.x.x.49864 > 38.145.60.14.domain: 62491% [1au] A? wildcard.fedoraproject.org. (55)
10:14:59.906196 IP (tos 0x0, ttl 64, id 36016, offset 0, flags [none], proto UDP (17), length 74)
    x.x.x.x.53032 > 85.236.55.10.domain: 45398% [1au] DNSKEY? fedoraproject.org. (46)
10:14:59.917198 IP (tos 0x0, ttl 56, id 453, offset 0, flags [none], proto UDP (17), length 565)
    38.145.60.14.domain > x.x.x.x.49864: 62491*- 12/0/1 wildcard.fedoraproject.org. A 152.19.134.198, wildcard.fedoraproject.org. A 38.145.60.20, wildcard.fedoraproject.org. A 8.43.85.67, wildcard.fedoraproject.org. A 8.43.85.73, wildcard.fedoraproject.org. A 67.219.144.68, wildcard.fedoraproject.org. A 152.19.134.142, wildcard.fedoraproject.org. A 209.132.190.2, wildcard.fedoraproject.org. A 140.211.169.196, wildcard.fedoraproject.org. A 140.211.169.206, wildcard.fedoraproject.org. A 38.145.60.21, wildcard.fedoraproject.org. RRSIG, wildcard.fedoraproject.org. RRSIG (537)
10:14:59.917439 IP (tos 0x0, ttl 64, id 38157, offset 0, flags [none], proto UDP (17), length 83)
    x.x.x.x.41571 > 38.145.60.13.domain: 38290% [1au] AAAA? wildcard.fedoraproject.org. (55)
10:14:59.926376 IP (tos 0x0, ttl 56, id 54909, offset 0, flags [none], proto UDP (17), length 573)
    38.145.60.13.domain > x.x.x.x.41571: 38290*- 8/0/1 wildcard.fedoraproject.org. AAAA 2620:52:3:1:dead:beef:cafe:fed7, wildcard.fedoraproject.org. AAAA 2605:bc80:3010:600:dead:beef:cafe:fed9, wildcard.fedoraproject.org. AAAA 2604:1580:fe00:0:dead:beef:cafe:fed1, wildcard.fedoraproject.org. AAAA 2605:bc80:3010:600:dead:beef:cafe:feda, wildcard.fedoraproject.org. AAAA 2620:52:3:1:dead:beef:cafe:fed6, wildcard.fedoraproject.org. AAAA 2610:28:3090:3001:dead:beef:cafe:fed3, wildcard.fedoraproject.org. RRSIG, wildcard.fedoraproject.org. RRSIG (545)
10:14:59.959706 IP (tos 0x0, ttl 56, id 43304, offset 0, flags [none], proto UDP (17), length 909)
    85.236.55.10.domain > x.x.x.x.57451: 21341*- 15/0/1 mirrors.fedoraproject.org. CNAME wildcard.fedoraproject.org., mirrors.fedoraproject.org. RRSIG, mirrors.fedoraproject.org. RRSIG, wildcard.fedoraproject.org. A 67.219.144.68, wildcard.fedoraproject.org. A 38.145.60.20, wildcard.fedoraproject.org. A 140.211.169.196, wildcard.fedoraproject.org. A 38.145.60.21, wildcard.fedoraproject.org. A 209.132.190.2, wildcard.fedoraproject.org. A 8.43.85.73, wildcard.fedoraproject.org. A 8.43.85.67, wildcard.fedoraproject.org. A 152.19.134.198, wildcard.fedoraproject.org. A 152.19.134.142, wildcard.fedoraproject.org. A 140.211.169.206, wildcard.fedoraproject.org. RRSIG, wildcard.fedoraproject.org. RRSIG (881)
10:14:59.960064 IP (tos 0x0, ttl 64, id 32004, offset 0, flags [none], proto UDP (17), length 83)
    x.x.x.x.60137 > 38.145.60.14.domain: 46187% [1au] A? wildcard.fedoraproject.org. (55)
10:14:59.968029 IP (tos 0x0, ttl 55, id 465, offset 0, flags [none], proto UDP (17), length 565)
    38.145.60.14.domain > x.x.x.x.60137: 46187*- 12/0/1 wildcard.fedoraproject.org. A 38.145.60.21, wildcard.fedoraproject.org. A 8.43.85.67, wildcard.fedoraproject.org. A 38.145.60.20, wildcard.fedoraproject.org. A 140.211.169.196, wildcard.fedoraproject.org. A 140.211.169.206, wildcard.fedoraproject.org. A 152.19.134.198, wildcard.fedoraproject.org. A 152.19.134.142, wildcard.fedoraproject.org. A 209.132.190.2, wildcard.fedoraproject.org. A 8.43.85.73, wildcard.fedoraproject.org. A 67.219.144.68, wildcard.fedoraproject.org. RRSIG, wildcard.fedoraproject.org. RRSIG (537)
10:14:59.994171 IP (tos 0x0, ttl 56, id 43315, offset 0, flags [none], proto UDP (17), length 74)
    85.236.55.10.domain > x.x.x.x.53032: 45398*-| 0/0/1 (46)
10:14:59.995502 IP (tos 0x0, ttl 64, id 63925, offset 0, flags [none], proto UDP (17), length 74)
    x.x.x.x.60014 > 152.19.134.139.domain: 34916% [1au] DNSKEY? fedoraproject.org. (46)

Edit: DNSSEC test:

dig +dnssec test.dnssec-or-not.net TXT

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> +dnssec test.dnssec-or-not.net TXT
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43955
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
;; QUESTION SECTION:
;test.dnssec-or-not.net.                IN      TXT

;; AUTHORITY SECTION:
dnssec-or-not.net.      600     IN      SOA     jm1.dns.com. admin.dns.com. 1611470736 7200 3600 1209600 1800

;; Query time: 736 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Mar 31 10:22:52 EDT 2021
;; MSG SIZE  rcvd: 104
dig com. SOA +dnssec

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> com. SOA +dnssec
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13642
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
;; QUESTION SECTION:
;com.                           IN      SOA

;; ANSWER SECTION:
com.                    900     IN      SOA     a.gtld-servers.net. nstld.verisign-grs.com. 1617200653 1800 900 604800 86400
com.                    900     IN      RRSIG   SOA 8 1 900 20210407142413 20210331131413 58540 com. RM7D7VuV6RHpGQo59FtsfvUbs290E2nJvRUKcJ4dTlt8V90c3wUzJx65 5qJfc9Feol3iq9DQ5S6Qmbu8Mfh2nS71mfy6WBLsVkf5k7Qx7KBOcSmf 4kkWWOdHMWA0jiRKhNmjwrU4Db9fBbObqvxDWxWWiV3z7IuWazuR6phV fSeP0KAMNp1/siGaFz3l/ETd7gOS+PqbAc2jeRmeteyr5Q==

;; Query time: 19 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Mar 31 10:24:34 EDT 2021
;; MSG SIZE  rcvd: 300
wcawijngaards commented 3 years ago

The log snippet you show only shows the lack of IPv6 network, and that is not a concern here. You need to dig into the logs to see where the lookups fail (for other reasons than IPv6 network not connected).

My guess is that the fedoraproject.org DNSKEY query is failing. It asks three times in the snippet you give. And it does not give answers. Can you try that with dig commands? Perhaps you have PMTU problems, the packet is too big to fit in an UDP reply to you, such large packets are not allowed or possible near your machine on the network. And this causes TCP fallback, to fetch it over TCP if it manages to signal that it is too big, so it could be the issue.

SomePersonSomeWhereInTheWorld commented 3 years ago

OK I think we are getting somewhere. I posted a debug log for the duration of the failing dnf here. Note when I set val-permissive-mode to yes dnf update works.

building chain of trust
Mar 31 11:33:53 unbound[3925918:1] debug: mesh_run: validator module exit state is module_finished
Mar 31 11:33:53 unbound[3925918:1] debug: ipsecmod[module 0] operate: extstate:module_state_initial event:module_event_moddone
Mar 31 11:33:53 unbound[3925918:1] debug: mesh_run: ipsecmod module exit state is module_finished
Mar 31 11:33:53 unbound[3925918:1] debug: validator[module 1] operate: extstate:module_wait_subquery event:module_event_pass
Mar 31 11:33:53 unbound[3925918:1] info: validator operate: query mirrors.fedoraproject.org. AAAA IN
Mar 31 11:33:53 unbound[3925918:1] debug: val handle processing q with state VAL_VALIDATE_STATE
Mar 31 11:33:53 unbound[3925918:1] info: Could not establish a chain of trust to keys for fedoraproject.org. DNSKEY IN
Mar 31 11:33:53 unbound[3925918:1] debug: val handle processing q with state VAL_FINISHED_STATE
Mar 31 11:33:53 unbound[3925918:1] info: validation failure <mirrors.fedoraproject.org. AAAA IN>: No DNSKEY record for key fedoraproject.org. while building chain of trust
Mar 31 11:33:53 unbound[3925918:1] debug: mesh_run: validator module exit state is module_finished
Mar 31 11:33:53 unbound[3925918:1] debug: ipsecmod[module 0] operate: extstate:module_wait_module event:module_event_moddone
Mar 31 11:33:53 unbound[3925918:1] info: ipsecmod operate: query mirrors.fedoraproject.org. AAAA IN
Mar 31 11:33:53 unbound[3925918:1] debug: mesh_run: ipsecmod module exit state is module_finished
Mar 31 11:33:53 unbound[3925918:1] info: send_udp over interface: 1 127.0.0.1 127.0.0.1
Mar 31 11:33:53 unbound[3925918:1] debug: query took 7.962752 sec
Mar 31 11:33:53 unbound[3925918:1] reply: 127.0.0.1 mirrors.fedoraproject.org. AAAA IN SERVFAIL 7.962752 0 43
Mar 31 11:33:53 unbound[3925918:1] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 4 recursion replies sent, 0 replies dropped, 0 states jostled out
Mar 31 11:33:53 unbound[3925918:1] info: average recursion processing time 3.760770 sec
Mar 31 11:33:53 unbound[3925918:1] info: histogram of recursion processing times
Mar 31 11:33:53 unbound[3925918:1] info: [25%]=0.196608 median[50%]=0.262144 [75%]=6
Mar 31 11:33:53 unbound[3925918:1] info: lower(secs) upper(secs) recursions
Mar 31 11:33:53 unbound[3925918:1] info:    0.131072    0.262144 2
Mar 31 11:33:53 unbound[3925918:1] info:    4.000000    8.000000 2
Mar 31 11:33:53 unbound[3925918:1] debug: cache memory msg=144072 rrset=157024 infra=47892 val=80250 subnet=0
dig fedoraproject.org +dnssec

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> fedoraproject.org +dnssec
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 47968
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
;; QUESTION SECTION:
;fedoraproject.org.             IN      A

;; Query time: 1754 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Mar 31 11:48:15 EDT 2021
;; MSG SIZE  rcvd: 46
wcawijngaards commented 3 years ago

Looks like that DNSKEY lookup is not working. Try digging for the dig fedoraproject.org DNSKEY +dnssec and see how that works out; that is what seems to be failing for unbound. Looks like PMTU and no TCP fallback possible because TCP is blocked by firewall.

SomePersonSomeWhereInTheWorld commented 3 years ago

Yes looks like it's failing. I tested do-tcp: no, same problem. Did you see the debug logs? It's a ping-pong match, going back and forth between UDP and TCP.

 dig fedoraproject.org DNSKEY +dnssec

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> fedoraproject.org DNSKEY +dnssec
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 52761
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
;; QUESTION SECTION:
;fedoraproject.org.             IN      DNSKEY

;; Query time: 777 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Apr 01 08:57:21 EDT 2021
;; MSG SIZE  rcvd: 46
wcawijngaards commented 3 years ago

No, I mean do that dig to the ip address of one of the nameservers for fedoraproject.org. Not to the unbound like you just did.

If so, that it fails to work, then it seems the PMTU size is a problem. What edns-buffer-size do you have configured? Consider lowering it to 1232, and that should make fedoraproject.org return truncated. But then unbound has to use TCP to fetch it and you have no TCP access. If you fix the TCP connectivity it then works, I guess.

SomePersonSomeWhereInTheWorld commented 3 years ago

I'm not sure what you are asking me to do. Below are dig's from another server not using unbound to both the NS and IP for fedoraproject.org. As for edns-buffer-size it still is commented out to the default 1232:

       # EDNS reassembly buffer to advertise to UDP peers (the actual buffer
        # is set with msg-buffer-size). 1472 can solve fragmentation (timeouts)
        # edns-buffer-size: 1232
dig ns02.fedoraproject.org

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> ns02.fedoraproject.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34521
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;ns02.fedoraproject.org.                IN      A

;; ANSWER SECTION:
ns02.fedoraproject.org. 20270   IN      A       152.19.134.139

;; Query time: 1 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Apr 01 09:23:29 EDT 2021
;; MSG SIZE  rcvd: 67
dig 152.19.134.139

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> 152.19.134.139
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1017
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;152.19.134.139.                        IN      A

;; Query time: 9 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Apr 01 09:23:37 EDT 2021
;; MSG SIZE  rcvd: 43
dig DNSKEY +dnssec 152.19.134.139

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> DNSKEY +dnssec 152.19.134.139
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 9663
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 65494
;; QUESTION SECTION:
;152.19.134.139.                        IN      DNSKEY

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Apr 01 09:28:46 EDT 2021
;; MSG SIZE  rcvd: 43
wcawijngaards commented 3 years ago

dig @152.19.134.139 fedoraproject.org DNSKEY +dnssec is what I wanted, that is similar to what unbound does itself.

From your logs it looks like that fails. It tries to signal via UDP that it is truncated. Your edns-buffer-size looks okay. And the reason that this lookup fails is then because TCP fails for you. The dig command should show that.

You then have the proof what query fails and why. I guess you need to fix TCP connectivity.

SomePersonSomeWhereInTheWorld commented 3 years ago

See below. What does fix this though is if I set val-permissive-mode to yeseverything including dnf update works. My colleagues believe the issue is related to DNSSEC.

dig @152.19.134.139 fedoraproject.org DNSKEY +dnssec

; <<>> DiG 9.11.28-RedHat-9.11.28-1.fc33 <<>> @152.19.134.139 fedoraproject.org DNSKEY +dnssec
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62227
;; flags: qr aa rd; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
; COOKIE: 16dedd5e423fea9b050752066065cbe5fb6c3fe1a5421253 (good)
;; QUESTION SECTION:
;fedoraproject.org.             IN      DNSKEY

;; ANSWER SECTION:
fedoraproject.org.      300     IN      DNSKEY  257 3 5 AwEAAdTXJc0joiKGfTvLXi+LXxGpKvPvOoJEst9PR8TCCvXGVp7h3BY3 uXLkjckuT0aopCp2KF8zHgNgpMK03p1fd94pn9JZSuxfqvKsiYH2KvNO a/655oPj06jRhqAP5grX01Iz4BH411ZhGxIQ1BzZtOr1wAazojMJzLUg ChRJs8GVt3LU0e6T8z1RQF33Dt9UMHIR5EAsFAqfZ/tsbfJDYktGoZi3 nFlW7A745+ObM1LNXOWq3FcYPVzhH08Q7/7WpxmzM6/ET8VeqWIsvh8E nZNDNMfJyPbY9B1BOIrFCpE03ALgFMejaBZwmeQaX+D4Duup5xGOmdtC O4GSpM1YH6c=
fedoraproject.org.      300     IN      DNSKEY  256 3 14 04ZsDOgyzs3kJsJ4jEY3MYufkCOWm1OI8N4M+dlBOBmweln0TSaKfafH zNCkaPiVG4bdgdnrzwxmjpK5GQgsiB47np+I8850Ea3EJG5ORDl3f//l rr92HiYh5DxCNhkG
fedoraproject.org.      300     IN      DNSKEY  257 3 14 7ttmhus8JD56ybsvMVZVsXa3U2R+2+WmOPIP7BU6t2LicosMZ2Ju3pfv ijsa5LvBvVCB4xVtLSqEdLSvW4vJPLSAB2uyJwHPJMezh0SzGmVCImLU 6qDxsxjHqtZ76/Sf
fedoraproject.org.      300     IN      DNSKEY  256 3 5 AwEAAcCWNQWl5pCI3iOOP2r8nStL60Zjb/2JQLQytamVap0L44z0YWft u7pu0hx3cnIM1ejQOsEwbg2/10IyC+38cYqJDXbSdFg1zGztOS5xNz7r 9hzSRK5N2jkycdJ/BoByJ4Y+XGpDqfG4I97++8sIzSrw60TmGAKTvM9v iL3ByeCN
fedoraproject.org.      300     IN      RRSIG   DNSKEY 5 2 300 20210429222424 20210330222424 7725 fedoraproject.org. MRAXV4nt4DGJYlbGhO3T4XMlt4Ytik1bHXC/0+cmrQ7REh2G06CiECcF pScnuHR7QgxlA85yzIvCcTbulrujV/GfEJLH02ulcUkioUZdAjXJVTy8 pbZdSGYcUcsvfqzu32SlNirRK0YrZGhKSTAS68MHGDb25qFJRen1lJSK KoE=
fedoraproject.org.      300     IN      RRSIG   DNSKEY 5 2 300 20210429222424 20210330222424 16207 fedoraproject.org. wiFDZd3mJ5Wi/s9/R+6lmqnjcY6NHNfwQ+vIChBOQNnkigX1+RZHNX1p 2xoLUojbIpJ+rn118HX4vQD086ggCyEeqsha5liUEeWXQJckzlgfB2zH aAmkCJdvUGeRaKCMwKcGMv5N++b2tmZjnsRtE83egcBV+0A+4zRRqAKC uw1tt6bSHYcEvpoFCVjJz8J9LDG/i7tFqqbhhDLhCLUtOieOwdWZlEYn rg1FOB5uFmgotzrmCKNZlcQWlloCMi6vVY55XeXbnKqz47PcKK/CD5+v LPmPvAf+OLvBxfBZySyWqvO8YiBYoqSEvqP4vy+Oa1ewabNjvzRpZS9V QHtxZw==
fedoraproject.org.      300     IN      RRSIG   DNSKEY 14 2 300 20210429222424 20210330222424 58125 fedoraproject.org. JoMriz/7Cv3F0FeDOt2A1LrhJxy4G/lE8U6nZgRsjYLx52PsqKRNppKG l3h34/nJcBLdqgAoNiG9MEsJaSDGsySrwknLg+1B9DQcLX+TpstY78Dq P7Z3Uxv4SZh8nh68
fedoraproject.org.      300     IN      RRSIG   DNSKEY 14 2 300 20210429222424 20210330222424 60624 fedoraproject.org. zIxjPKOU9BECip3qBwfMMxCsbsEvHflsX4UfaM0JeQ0E2V4g8Km0RNuN PDT/ZsYPJrk1xDstFKDicSP2OhMYJjI4y//tTGwm23Ki8ZDNC77KDdmP a+f/5pQaznN7EJuC

;; Query time: 30 msec
;; SERVER: 152.19.134.139#53(152.19.134.139)
;; WHEN: Thu Apr 01 09:34:29 EDT 2021
;; MSG SIZE  rcvd: 1494
wcawijngaards commented 3 years ago

So in your logs, that was tcpdump, unbound keeps asking for this but cannot get an answer. But when you dig for it, it works fine. Can you get a verbosity 4 log of it so I can see what unbound is doing, and getting as a reply? Because that looks different and not like what dig got.

Oh I just noticed that dig gets edns-buffer-size: 4096 . Not sure what it set in the query, it may also have set that. In that case edns-buffer-size: 4096 may fix your lookups. Because unbound has 1232, which is a better value because of fragmentation, but causes TCP fallback. And TCP does not work for this server. So 4096 would make it likely work without TCP fallback.

It would still be a better recommendation to have edns-buffer-size: 1232 (the default, so no config line needed), and fix so that TCP connectivity works for this server. That is better for UDP fragmentation and then lookups work too, using the TCP lookups when needed. But likely the edns-buffer-size setting above could fix your immediate problem.

SomePersonSomeWhereInTheWorld commented 3 years ago

Oh I just noticed that dig gets edns-buffer-size: 4096 . Not sure what it set in the query, it may also have set that. In that case edns-buffer-size: 4096 may fix your lookups. Because unbound has 1232, which is a better value because of fragmentation, but causes TCP fallback. And TCP does not work for this server. So 4096 would make it likely work without TCP fallback.

Indeed setting edns-buffer-size: 4096 fixes the dnf updates with fedoraproject.org! So in this case TCP fallback isn't needed. I will continue to follow up with our IT group on allowing 53/TCP.

What issues might be experienced with having edns-buffer-size: 4096?

wcawijngaards commented 3 years ago

Fragmentation issues, these can be network issues but also reassembly can get exploited. It was the topic of DNS Flag Day, https://dnsflagday.net/2020/ .

nursoda commented 2 years ago

Seems that this issue may be closed. Please don't hijack issues: This was about a file permission error upon service start, not firewalling issues upon service usage.

wcawijngaards commented 2 years ago

Sure, it seems that the issues have been resolved.