NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.16k stars 360 forks source link

Resolving domain dnsops.gov makes unbound not responding #354

Closed BKPepe closed 4 years ago

BKPepe commented 4 years ago

Hi guys,

While searching on the Internet, I found one domain, which I could use to test DANE and noticed there is definitely one broken domain dane-test.had.dnsops.gov, which makes unbound running, but it is not resolving anything and it is required to restart unbound. Looking at dnsviz.net, I see there are issues with resolving NS names main.dnsops.gov and main.dnsops.gov.

This was confirmed by using dig +trace dane-test.had.dnsops.gov as it fails with

couldn't get address for 'main.dnsops.gov': failure
couldn't get address for 'monitor.dnsops.gov': failure
dig: couldn't get address for 'main.dnsops.gov': no more

If I do dig dane-test.had.dnsops.gov - unbound stops resolving DNS servers, but it is running, it does not crash. Let's try to ask NS servers there is the same problem:

When using unbound:

root@turris:~# dig main.dnsops.gov

; <<>> DiG 9.16.8 <<>> main.dnsops.gov
;; global options: +cmd
;; connection timed out; no servers could be reached

When using unbound:

root@turris:~# dig monitor.dnsops.gov

; <<>> DiG 9.16.8 <<>> monitor.dnsops.gov
;; global options: +cmd
;; connection timed out; no servers could be reached

It seems like there is no Internet connection, but ping works just unbound is not resolving addresses. If trying to use Knot Resolver or asking DNS recursive resolvers like CZ.NIC ODVR, Cloudflare, and as well Google DNS, it returns SERVFAIL.

Domain dane-test.had.dnsops.gov while using Cloudflare

root@turris:~# dig @1.1.1.1 dane-test.had.dnsops.gov

; <<>> DiG 9.16.8 <<>> @1.1.1.1 dane-test.had.dnsops.gov
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 36342
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 22 (No Reachable Authority)
; EDE: 6 (DNSSEC Bogus)
;; QUESTION SECTION:
;dane-test.had.dnsops.gov.      IN      A

;; Query time: 1752 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Wed Nov 25 01:00:36 CET 2020
;; MSG SIZE  rcvd: 65

NS monitor.dnsops.gov while using Google DNS

root@turris:~# dig @8.8.8.8 monitor.dnsops.gov

; <<>> DiG 9.16.8 <<>> @8.8.8.8 monitor.dnsops.gov
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 63230
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;monitor.dnsops.gov.            IN      A

;; Query time: 4028 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Nov 25 00:56:54 CET 2020
;; MSG SIZE  rcvd: 47

This was tested on Unbound versions 1.11.0 and 1.12.0 using OpenWrt.

In strace, I see this and I am not sure if it is somehow useful:

brk(0x7952000)                          = 0x7952000
brk(0x7955000)                          = 0x7955000
brk(0x7958000)                          = 0x7958000
brk(0x795b000)                          = 0x795b000
open("/var/etc/unbound/root.keys", O_RDONLY|O_LARGEFILE) = 9
read(9, "; autotrust trust anchor file\n;;"..., 1024) = 758
read(9, "", 1024)                       = 0
close(9)                                = 0
getpid()                                = 23084
getrandom("\x79\x44\x50\xb6\x71\xca\xc3\xb1\xdd\xbd\x6c\x31\x03\x91\x36\x4d\x73\x60\xc0\xd8\xea\xbe\xe7\xb7\xb0\x8b\x6f\xd8\xa3\x95\xfd\xc2"..., 40, 0) = 40
mmap2(NULL, 8, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb79e9000
mmap2(NULL, 1088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb79e8000
mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7520000
getpid()                                = 23084
getpid()                                = 23084
getpid()                                = 23084
and since then multiple spam of getpid

Also, I tried to use verbose logging on Unbound, but after running it for just 2 minutes log is incredibly large (8MB, ~100 000 rows). If necessary, I can provide it.

Thanks belong to @VojtechMyslivec, who was helping me with this issue and reproducing it on his end, too.

ghost commented 4 years ago

on my node with kernel 5.10 rc-4 (not patched through downstream) and OpenWrt's unbound app (unbound-daemon_1.12.0-1_arm_cortex-a9_vfpv3-d16.ipk) the connectivity issue partly reproduces:

error: SERVFAIL <monitor.dnsops.gov. A IN>: could not fetch nameservers for 0x20 fallback
error: SERVFAIL <main.dnsops.gov. A IN>: exceeded the maximum number of sends
error: SERVFAIL <main.dnsops.gov. A IN>: could not fetch nameservers for 0x20 fallback
error: SERVFAIL <dane-test.had.dnsops.gov. A IN>: exceeded the maximum number of sends
error: SERVFAIL <dane-test.had.dnsops.gov. A IN>: exceeded the maximum number of sends

If I do dig dane-test.had.dnsops.gov - unbound stops resolving DNS servers, but it is running, it does not crash.

This does not reproduce, i.e. unbound still handles any other queries thereafter.

vcunat commented 4 years ago

I think it's clear that the zone is bad: many of their servers don't respond at all and the rest won't return any DNSKEY (though DS promises it). For me other concurrent queries (easy ones) weren't affected. Still, it's a bit weird that the SERVFAIL from Unbound arrived never or after veery long time:

;; From 127.0.0.1@53(UDP) in 96312.3 ms
ghost commented 4 years ago

it takes a bit yes, probably until it reaches the sends limit but not as long as you are reporting. Just restarted unbound to clear the cache and then

Screenshot 2020-11-25 102435

Not sure whether IPv4 / IPv6 matters.

wcawijngaards commented 4 years ago

Unbound probes servers that are not responding with fairly long timeouts. The timeouts are documented here https://www.nlnetlabs.nl/documentation/unbound/info-timeout/

So it is normal that it takes that long, there is no information yet and unbound has the capacity to keep trying for a while. It then caches the information should someone try again.

There is logic in there to stop these kinds of long queries from swamping unbound's request list, if that gets full the queries are dumped earlier.

wcawijngaards commented 4 years ago

I believe this closes the issue, so am marking it as closed.