exceeded the maximum nameserver nxdomains

NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.

https://nlnetlabs.nl/unbound

BSD 3-Clause "New" or "Revised" License

2.98k stars 345 forks source link

exceeded the maximum nameserver nxdomains #362

Open gzzhangxinjie opened 3 years ago

gzzhangxinjie commented 3 years ago

In my unbound server, I found the run log like

error: SERVFAIL <a.b.example.com A IN>: exceeded the maximum nameserver nxdomains`

(I use the a.b.example.com replace the true domain。)

The questions are :

what situation will cause this error？
we found that，not only a.b.example.com cannot find the answer，xx.example.com cannot find the answer, neither。 why ?

wcawijngaards commented 3 years ago

When looking up the addresses for nameservers, unbound encounters too many NXDOMAIN responses for those lookup and stops to avoid causing a denial of service. The domain has a long list of NS records, and those domains perhaps also have lists of NS records. All of those, or a lot of them, have no addresses and thus do not work. While trying to resolve the domain unbound is recursively looking up those nameservers and the nameservers to lookup those nameservers, and this is taking too much resources.

The domain should not have such a long list of nameservers that do not have addresses. Or nameservers for the nameservers that have no addresses. There was a CVE for that a while ago about this resource consumption causing issues, too many queries, too much resource usage on the DNS server. This error stops the resource usage.

mnalis commented 2 years ago

It however seems that exceeded the maximum nameserver nxdomains massively appears for otherwise normal domains after a temporary network connectivity loss (it seems unbound does not distinguish here between actual authorative NXDOMAIN and temporary network unavailability) and marks all domains which it tried to access while network was down as NXDOMAIN.

This leads to admin needing to restart unbound server manually after every WLAN/network connectivity issue, which makes it basically unusable as caching DNS server.

Debian Bullseye GNU/Linux, unbound 1.13.2-1 (using forward-tls-upstream: yes, forward-first: yes, forward-addr:)

e.g.

``` Aug 25 03:31:00 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:00 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:00 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:00 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:00 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:00 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:03 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:03 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:03 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:03 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:03 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:06 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:06 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:06 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:06 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:06 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 unbound: [1536:0] error: SSL_handshake syscall: No route to host Aug 25 03:31:09 wpa_supplicant[2138]: wlan0: SME: Trying to authenticate with xx:xx:xx:xx:xx:xx (SSID='whatever' freq=2437 MHz) Aug 25 03:31:10 wpa_supplicant[2138]: wlan0: Trying to associate with xx:xx:xx:xx:xx:xx (SSID='whatever' freq=2437 MHz) Aug 25 03:31:10 wpa_supplicant[2138]: wlan0: Associated with xx:xx:xx:xx:xx:xx Aug 25 03:31:10 wpa_supplicant[2138]: wlan0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0 Aug 25 03:31:10 wpa_supplicant[2138]: wlan0: WPA: Key negotiation completed with xx:xx:xx:xx:xx:xx [PTK=CCMP GTK=CCMP] Aug 25 03:31:10 wpa_supplicant[2138]: wlan0: CTRL-EVENT-CONNECTED - Connection to xx:xx:xx:xx:xx:xx completed [id=0 id_str=] Aug 25 03:31:13 unbound: [1536:0] error: SSL_handshake syscall: Connection refused Aug 25 03:31:13 unbound: [1536:0] error: SSL_handshake syscall: Connection refused Aug 25 03:31:13 unbound: [1536:0] error: SSL_handshake syscall: Connection refused Aug 25 03:31:16 unbound: [1536:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains [... many other addresses removed for privacy reasons...] Aug 25 03:33:31 unbound: [1536:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Aug 25 03:33:39 unbound: [1536:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Aug 25 03:33:39 unbound: [1536:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Aug 25 03:34:30 unbound: [1536:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Aug 25 03:34:30 unbound: [1536:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains ```

dMopp commented 1 year ago

I have observed the same issue.

Today i've updated my openwrt and afterwards i had to restart unbound inside my container...

Sep 08 07:20:26 unbound[837545:0] error: SERVFAIL <www.iana.org. AAAA IN>: exceeded the maximum number of sends
Sep 08 07:20:26 unbound[837545:0] error: SERVFAIL <www.iana.org. AAAA IN>: exceeded the maximum number of sends
Sep 08 07:20:28 unbound[837545:1] error: SERVFAIL <cocoapi.bmwgroup.com. AAAA IN>: exceeded the maximum number of sends
Sep 08 07:20:41 unbound[837545:1] error: SERVFAIL <nv2-namain.netatmo.net. A IN>: exceeded the maximum number of sends
Sep 08 07:20:41 unbound[837545:1] error: SERVFAIL <nv2-namain.netatmo.net. A IN>: exceeded the maximum number of sends
Sep 08 07:20:45 unbound[837545:0] error: SERVFAIL <diag.meethue.com. AAAA IN>: exceeded the maximum nameserver nxdomains
Sep 08 07:21:02 unbound[837545:0] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:02 unbound[837545:0] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:12 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:12 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. no server to query nameserver addresses not usable
Sep 08 07:21:20 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. no server to query nameserver addresses not usable
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: exceeded the maximum number of sends
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: exceeded the maximum number of sends
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:26 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:27 unbound[837545:1] error: SERVFAIL <rt.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:37 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:37 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:22:06 unbound[837545:1] error: SERVFAIL <api.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. no server to query nameserver addresses not usable

after restart, everything was fine again...

cboltz commented 1 year ago

I see the same problem here, and have to regularly restart unbound after my RSS reader tries to fetch all of my feeds. (I admit that I read lots of RSS feeds. This ncludes some feeds on no longer existing domains - but I keep them in my archive, and my RSS reader unfortunately still tries to fetch them.)

Is there any solution or workaround (like a config option to raise the query and/or nxdomain limit) for this issue?

c3ph3us commented 1 year ago

same here! expecting for fast resolution

is there a way to instruct the unbound to not account nxdomains if there is no connection with upstream resolver pointed by forward-zone: forward-addr: ??? and log the connection related issue ?

i may do periodic check logs and when i encounter a such situation raise a SIGHUP for unbound process but its hackish ugly solution .....

root@dl360g7:~# unbound -V Version 1.16.2

Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-pythonmodule --with-pyunbound --enable-subnet --enable-dnstap --enable-systemd --with-libnghttp2 --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --disable-rpath --with-pidfile=/run/unbound.pid --with-libevent --enable-tfo-client --with-rootkey-file=/usr/share/dns/root.key --enable-tfo-server Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.5 5 Jul 2022 Linked modules: dns64 python subnetcache respip validator iterator TCP Fastopen feature available

ct 27 20:57:43 unbound[124347:1] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains Oct 27 20:57:43 unbound[124347:1] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains Oct 27 20:57:43 unbound[124347:0] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains Oct 27 20:57:43 unbound[124347:2] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains Oct 27 20:57:46 unbound[124347:3] info: server stats for thread 3: 7072191 queries, 6857794 answers from cache, 214397 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 27 20:57:46 unbound[124347:3] info: server stats for thread 3: requestlist max 90 avg 1.21422 exceeded 0 jostled 0 Oct 27 20:57:46 unbound[124347:3] info: average recursion processing time 1.206027 sec Oct 27 20:57:46 unbound[124347:3] info: histogram of recursion processing times Oct 27 20:57:46 unbound[124347:3] info: [25%]=0.0793683 median[50%]=0.174644 [75%]=1.0103 Oct 27 20:57:46 unbound[124347:3] info: lower(secs) upper(secs) recursions Oct 27 20:57:46 unbound[124347:3] info: 0.000000 0.000001 1490 Oct 27 20:57:46 unbound[124347:3] info: 0.000016 0.000032 10 Oct 27 20:57:46 unbound[124347:3] info: 0.000032 0.000064 14 Oct 27 20:57:46 unbound[124347:3] info: 0.000064 0.000128 18 Oct 27 20:57:46 unbound[124347:3] info: 0.000128 0.000256 44 Oct 27 20:57:46 unbound[124347:3] info: 0.000256 0.000512 223 Oct 27 20:57:46 unbound[124347:3] info: 0.000512 0.001024 2985 Oct 27 20:57:46 unbound[124347:3] info: 0.001024 0.002048 17083 Oct 27 20:57:46 unbound[124347:3] info: 0.002048 0.004096 1885 Oct 27 20:57:46 unbound[124347:3] info: 0.004096 0.008192 1443 Oct 27 20:57:46 unbound[124347:3] info: 0.008192 0.016384 1238 Oct 27 20:57:46 unbound[124347:3] info: 0.016384 0.032768 856 Oct 27 20:57:46 unbound[124347:3] info: 0.032768 0.065536 16046 Oct 27 20:57:46 unbound[124347:3] info: 0.065536 0.131072 48631 Oct 27 20:57:46 unbound[124347:3] info: 0.131072 0.262144 45822 Oct 27 20:57:46 unbound[124347:3] info: 0.262144 0.524288 19299 Oct 27 20:57:46 unbound[124347:3] info: 0.524288 1.000000 3482 Oct 27 20:57:46 unbound[124347:3] info: 1.000000 2.000000 22201 Oct 27 20:57:46 unbound[124347:3] info: 2.000000 4.000000 13114 Oct 27 20:57:46 unbound[124347:3] info: 4.000000 8.000000 14462 Oct 27 20:57:46 unbound[124347:3] info: 8.000000 16.000000 3075 Oct 27 20:57:46 unbound[124347:3] info: 16.000000 32.000000 486 Oct 27 20:57:46 unbound[124347:3] info: 32.000000 64.000000 288 Oct 27 20:57:46 unbound[124347:3] info: 64.000000 128.000000 45 Oct 27 20:57:46 unbound[124347:3] info: 128.000000 256.000000 109 Oct 27 20:57:46 unbound[124347:3] info: 256.000000 512.000000 48 Oct 27 20:57:47 unbound[124347:0] info: server stats for thread 0: 6956869 queries, 6746181 answers from cache, 210688 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 27 20:57:47 unbound[124347:0] info: server stats for thread 0: requestlist max 79 avg 1.18437 exceeded 0 jostled 0 Oct 27 20:57:47 unbound[124347:0] info: average recursion processing time 1.236167 sec Oct 27 20:57:47 unbound[124347:0] info: histogram of recursion processing times Oct 27 20:57:47 unbound[124347:0] info: [25%]=0.0791236 median[50%]=0.173051 [75%]=1.00855 Oct 27 20:57:47 unbound[124347:0] info: lower(secs) upper(secs) recursions Oct 27 20:57:47 unbound[124347:0] info: 0.000000 0.000001 1417 Oct 27 20:57:47 unbound[124347:0] info: 0.000008 0.000016 1 Oct 27 20:57:47 unbound[124347:0] info: 0.000016 0.000032 11 Oct 27 20:57:47 unbound[124347:0] info: 0.000032 0.000064 14 Oct 27 20:57:47 unbound[124347:0] info: 0.000064 0.000128 22 Oct 27 20:57:47 unbound[124347:0] info: 0.000128 0.000256 44 Oct 27 20:57:47 unbound[124347:0] info: 0.000256 0.000512 241 Oct 27 20:57:47 unbound[124347:0] info: 0.000512 0.001024 2893 Oct 27 20:57:47 unbound[124347:0] info: 0.001024 0.002048 16893 Oct 27 20:57:47 unbound[124347:0] info: 0.002048 0.004096 1888 Oct 27 20:57:47 unbound[124347:0] info: 0.004096 0.008192 1447 Oct 27 20:57:47 unbound[124347:0] info: 0.008192 0.016384 1117 Oct 27 20:57:47 unbound[124347:0] info: 0.016384 0.032768 917 Oct 27 20:57:47 unbound[124347:0] info: 0.032768 0.065536 15749 Oct 27 20:57:47 unbound[124347:0] info: 0.065536 0.131072 48319 Oct 27 20:57:47 unbound[124347:0] info: 0.131072 0.262144 44871 Oct 27 20:57:47 unbound[124347:0] info: 0.262144 0.524288 18628 Oct 27 20:57:47 unbound[124347:0] info: 0.524288 1.000000 3358 Oct 27 20:57:47 unbound[124347:0] info: 1.000000 2.000000 21751 Oct 27 20:57:47 unbound[124347:0] info: 2.000000 4.000000 12796 Oct 27 20:57:47 unbound[124347:0] info: 4.000000 8.000000 14358 Oct 27 20:57:47 unbound[124347:0] info: 8.000000 16.000000 2962 Oct 27 20:57:47 unbound[124347:0] info: 16.000000 32.000000 538 Oct 27 20:57:47 unbound[124347:0] info: 32.000000 64.000000 230 Oct 27 20:57:47 unbound[124347:0] info: 64.000000 128.000000 44 Oct 27 20:57:47 unbound[124347:0] info: 128.000000 256.000000 105 Oct 27 20:57:47 unbound[124347:0] info: 256.000000 512.000000 74 Oct 27 20:57:47 unbound[124347:2] info: server stats for thread 2: 6998925 queries, 6787092 answers from cache, 211833 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 27 20:57:47 unbound[124347:2] info: server stats for thread 2: requestlist max 90 avg 1.15098 exceeded 0 jostled 0 Oct 27 20:57:47 unbound[124347:2] info: average recursion processing time 1.439691 sec Oct 27 20:57:47 unbound[124347:2] info: histogram of recursion processing times Oct 27 20:57:47 unbound[124347:2] info: [25%]=0.0793255 median[50%]=0.173985 [75%]=1.0245 Oct 27 20:57:47 unbound[124347:2] info: lower(secs) upper(secs) recursions Oct 27 20:57:47 unbound[124347:2] info: 0.000000 0.000001 1465 Oct 27 20:57:47 unbound[124347:2] info: 0.000016 0.000032 4 Oct 27 20:57:47 unbound[124347:2] info: 0.000032 0.000064 6 Oct 27 20:57:47 unbound[124347:2] info: 0.000064 0.000128 17 Oct 27 20:57:47 unbound[124347:2] info: 0.000128 0.000256 49 Oct 27 20:57:47 unbound[124347:2] info: 0.000256 0.000512 259 Oct 27 20:57:47 unbound[124347:2] info: 0.000512 0.001024 2941 Oct 27 20:57:47 unbound[124347:2] info: 0.001024 0.002048 16925 Oct 27 20:57:47 unbound[124347:2] info: 0.002048 0.004096 1879 Oct 27 20:57:47 unbound[124347:2] info: 0.004096 0.008192 1394 Oct 27 20:57:47 unbound[124347:2] info: 0.008192 0.016384 1295 Oct 27 20:57:47 unbound[124347:2] info: 0.016384 0.032768 821 Oct 27 20:57:47 unbound[124347:2] info: 0.032768 0.065536 15737 Oct 27 20:57:47 unbound[124347:2] info: 0.065536 0.131072 48316 Oct 27 20:57:47 unbound[124347:2] info: 0.131072 0.262144 45231 Oct 27 20:57:47 unbound[124347:2] info: 0.262144 0.524288 18607 Oct 27 20:57:47 unbound[124347:2] info: 0.524288 1.000000 3385 Oct 27 20:57:47 unbound[124347:2] info: 1.000000 2.000000 22190 Oct 27 20:57:47 unbound[124347:2] info: 2.000000 4.000000 12926 Oct 27 20:57:47 unbound[124347:2] info: 4.000000 8.000000 14583 Oct 27 20:57:47 unbound[124347:2] info: 8.000000 16.000000 3050 Oct 27 20:57:47 unbound[124347:2] info: 16.000000 32.000000 292 Oct 27 20:57:47 unbound[124347:2] info: 32.000000 64.000000 99 Oct 27 20:57:47 unbound[124347:2] info: 64.000000 128.000000 42 Oct 27 20:57:47 unbound[124347:2] info: 128.000000 256.000000 107 Oct 27 20:57:47 unbound[124347:2] info: 256.000000 512.000000 213 Oct 27 20:57:47 unbound[124347:1] info: server stats for thread 1: 7080760 queries, 6865904 answers from cache, 214856 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 27 20:57:47 unbound[124347:1] info: server stats for thread 1: requestlist max 78 avg 1.17997 exceeded 0 jostled 0 Oct 27 20:57:47 unbound[124347:1] info: average recursion processing time 1.252491 sec Oct 27 20:57:47 unbound[124347:1] info: histogram of recursion processing times Oct 27 20:57:47 unbound[124347:1] info: [25%]=0.0798659 median[50%]=0.17527 [75%]=1.00988 Oct 27 20:57:47 unbound[124347:1] info: lower(secs) upper(secs) recursions Oct 27 20:57:47 unbound[124347:1] info: 0.000000 0.000001 1494 Oct 27 20:57:47 unbound[124347:1] info: 0.000016 0.000032 7 Oct 27 20:57:47 unbound[124347:1] info: 0.000032 0.000064 17 Oct 27 20:57:47 unbound[124347:1] info: 0.000064 0.000128 18 Oct 27 20:57:47 unbound[124347:1] info: 0.000128 0.000256 61 Oct 27 20:57:47 unbound[124347:1] info: 0.000256 0.000512 285 Oct 27 20:57:47 unbound[124347:1] info: 0.000512 0.001024 2994 Oct 27 20:57:47 unbound[124347:1] info: 0.001024 0.002048 17001 Oct 27 20:57:47 unbound[124347:1] info: 0.002048 0.004096 1834 Oct 27 20:57:47 unbound[124347:1] info: 0.004096 0.008192 1353 Oct 27 20:57:47 unbound[124347:1] info: 0.008192 0.016384 1271 Oct 27 20:57:47 unbound[124347:1] info: 0.016384 0.032768 835 Oct 27 20:57:47 unbound[124347:1] info: 0.032768 0.065536 15886 Oct 27 20:57:47 unbound[124347:1] info: 0.065536 0.131072 48743 Oct 27 20:57:47 unbound[124347:1] info: 0.131072 0.262144 46349 Oct 27 20:57:47 unbound[124347:1] info: 0.262144 0.524288 19202 Oct 27 20:57:47 unbound[124347:1] info: 0.524288 1.000000 3572 Oct 27 20:57:47 unbound[124347:1] info: 1.000000 2.000000 22273 Oct 27 20:57:47 unbound[124347:1] info: 2.000000 4.000000 13177 Oct 27 20:57:47 unbound[124347:1] info: 4.000000 8.000000 14508 Oct 27 20:57:47 unbound[124347:1] info: 8.000000 16.000000 3071 Oct 27 20:57:47 unbound[124347:1] info: 16.000000 32.000000 448 Oct 27 20:57:47 unbound[124347:1] info: 32.000000 64.000000 176 Oct 27 20:57:47 unbound[124347:1] info: 64.000000 128.000000 70 Oct 27 20:57:47 unbound[124347:1] info: 128.000000 256.000000 131 Oct 27 20:57:47 unbound[124347:1] info: 256.000000 512.000000 80 Oct 27 20:57:50 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:50 unbound[124347:1] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:51 unbound[124347:2] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:51 unbound[124347:1] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:2] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:2] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:2] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:3] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:1] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:54 unbound[124347:3] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:56 unbound[124347:1] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:56 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:56 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:57:56 unbound[124347:1] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:58:03 unbound[124347:3] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:58:03 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:58:04 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:58:04 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:58:04 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains Oct 27 20:58:04 unbound[124347:0] error: SERVFAIL : exceeded the maximum nameserver nxdomains

wcawijngaards commented 1 year ago

Right now it counts the failure to lookup the nameservers as a failure that counts to the maximum failed nameserver lookups. Since the lookups are failing, it is reasonable that the query also fails. But the error message prints about nxdomains and the error text is not about connection failures. It is visible in the code that the subquery made a servfail, but not that this was due to connectivity. And the servfail is not nxdomain, but it still has to go towards failing the query itself, because the spawned lookup fails.

CompuRoot commented 1 year ago

When looking up the addresses for nameservers, unbound encounters too many NXDOMAIN responses for those lookup and stops to avoid causing a denial of service.

Such workflow might work for a setup when unbound serves a regular clients, but in email world, email servers querying on each incoming connection over port 25 - multiple antispam external services that are based on RBL DNS daemons protocol and there it's normal for those services to return NXDOMAIN as result of testing for IP addresses that tells email server that IP in question is listed or not as abuser/spammer. Those services (spamhaus.org, dnswl.org, barracudacentral.org, phishtank.rspamd.com, msbl.org, spamcop.net... and many others ) requires to query directly from email server and won't work if unbound is in forwarding mode (to prevent those services from abuse as well DDoS) so unbound don't acting as abuser but just following RBL protocol and when unbound switching to "exceeded the maximum nameserver nxdomains" mode, it became useless.

Is there any option to turn such protection off?

wcawijngaards commented 1 year ago

That does not sound like a problem. Because an NXDOMAIN for the target query is not the issue. It is NXDOMAIN for the nameserver for the domain, for the address of the DNS nameserver. An NXDOMAIN for the target query, for the antispam service lookup, is not an issue. That can be any number of queries. So that should not be an issue, for the test of an IP address lookup.

Protection against the overload issue, is not something that should be optional.

If you want to exercise control over it, a stub-zone with particular nameserver names, or nameserver addresses listed for it, in the Unbound configuration, is going to make unbound use those nameservers. And if that lists existing nameservers, it would change the nameserver lookups.

CompuRoot commented 1 year ago

So that should not be an issue, for the test of an IP address lookup.

Looks like an issue:

Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <b.dwl-ns.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains                                                                        
Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <c.dwl-ns.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains                                                                        
Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <removed2avoidSubdomain-disclosure.dwl.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains                                                            
Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <a.dwl-ns.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains                                                                        
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <d.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains                                                                         
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <e.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains                                                                         
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <removed2avoidSubdomain-disclosure.spamhaus.com. A IN>: exceeded the maximum nameserver nxdomains                                                                     
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <removed2avoidSubdomain-disclosure.dbl.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains                                                          
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <b.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains                                                                         
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <d.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains                                                                         
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <e.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains                                                                         
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <removed2avoidSubdomain-disclosure.dbl.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains                                                           
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <a.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains

If you want to exercise control over it, a stub-zone with particular nameserver names

Some antispam services rollover RBL domains to hide them from spammers, so it not always known upfront what domain(s) should be added in stub-zone unfortunately

CompuRoot commented 1 year ago

BTW, are these servfails below are related to the same case?

Mar 01 10:20:20 unbound[8110:3] info: 127.0.0.1 1.0.0.127.bl.spameatingmonkey.net. A IN SERVFAIL 0.068697 0 62                                                                                
Mar 01 10:20:25 unbound[8110:0] info: 127.0.0.1 1.0.0.127.rep.mailspike.net. A IN SERVFAIL 0.073550 0 56                                                                                      
Mar 01 10:20:57 unbound[8110:1] info: 127.0.0.1 removed2avoidSubdomain-disclosure.multi.surbl.org. A IN SERVFAIL 0.072373 0 62                                                                                
Mar 01 10:21:11 unbound[8110:0] info: 127.0.0.1 removed2avoidSubdomain-disclosure.uribl.spameatingmonkey.net. A IN SERVFAIL 0.070020 0 64                                                                              
Mar 01 10:21:59 unbound[8110:0] info: 127.0.0.1 removed2avoidSubdomain-disclosure.ebl.msbl.org. A IN SERVFAIL 0.263217 0 60                                                                                  
Mar 01 10:22:30 unbound[8110:0] info: 127.0.0.1 removed2avoidSubdomain-disclosure.email.rspamd.com. A IN SERVFAIL 0.071349 0 59

and SERVFAIL like this (XXX.XXX.XXX.XXX is replacement for real IP):

Mar 08 02:42:25 unbound[4185:3] error: SERVFAIL <XXX.XXX.XXX.XXX.bl.score.senderscore.com. A IN>: all servers for this domain failed, at zone bl.score.senderscore.com.                       
Mar 08 02:42:25 unbound[4185:3] info: 127.0.0.1 XXX.XXX.XXX.XXX.bl.score.senderscore.com. A IN SERVFAIL 0.365197 0 69

The only cure is to restart unbound, but on busy servers it isn't solution at all

wcawijngaards commented 1 year ago

No the IP address lookup in the antispam zone is not the issue, that that answer is NXDOMAIN. The issues from the logs are that the nameservers are NXDOMAIN. So it is about the number of nameserver addresses that are NXDOMAIN.

Awkward that a stub zone then does not work, if the contents change. This removes the workaround option.

The issue looks like the actual zone configuration has a lot of NXDOMAIN nameserver addresses and this triggers the protection measures that does not allow it. The large number of NXDOMAINs for nameserver addresses is not allowed by the protection measure. This is certainly an unusual configuration.

Now we are talking about the protection measure and if the configuration of the nameservers for the zone is reasonable. Because the resolver has to choose, to protect against too many NXDOMAIN lookups, or resolve this zone. Security is most important, so that means the zone does not resolve. The zone was probably set up before this issue was reported, but now that a large number of NXDOMAINs is rejected, the set up is not something the resolver can accept.

It would be nice to be able, somehow, to figure out that this is a legitimate zone. Humorously, just like the antispam address lookup is also trying to figure out. But I do not see an easy way for that.

CompuRoot commented 1 year ago

The issues from the logs are that the nameservers are NXDOMAIN.

But those aren't NXDOMAIN after unbound been restated. Those (as well any other that frequently queried) became NXDOMAIN after some period of time of unbound use (it might be an hour or less on busy days as well resolving might work fine for a day or two without need to restart on not so busy servers). As far unbound restarted, it works as expected again... for a short period of time and recognizing the same domains without thinking those are NXDOMAIN.

This is certainly an unusual configuration... the set up is not something the resolver can accept.

unbound on email servers working in non forwarding mode and use root-hints: "/etc/unbound/named.cache" where named.cache updated once per month (if their are some changes) from https://www.internic.net/domain/named.cache. There no any "privacy/countries" blocklists loaded, just a few records (3 to 5) are added to local-zone: as always_nxdomain. The server itself where unbound working, is static:

  local-zone:     "mx.domain.com."    static
  local-data:     "mx.domain.com.     900  IN  A  1.2.3.4"
  local-data-ptr: "1.2.3.4            900  mx.domain.com."

0x20 is disabled, DNSSEC as well disabled too with module-config: "iterator" (that helps a little but didn't resolved the resolving issue)
Nothing special in configuration that might trigger SERVFAIL. Older versions (1.4.x) of unbound in the same subnet are working fine, but 1.9.0 (on debian 10) is constantly failing. Also, another group of servers that uses PowerDNS aren't affected with such issues.

Security is most important, so that means the zone does not resolve.

The funny things is that some email servers used for security notification but failing to send emails due to they can't resolve MX of receiving servers, so it turned into "security for DNS", but fail to send actual security accidents. BTW, receiver's MX are on registrar-servers.com where TTL set to 5 minutes that works pretty stable and resolves without any issues if one would try to resolve querying directly registrar-servers.com NS servers while unbound is choking if it got too many queries for the same zone.

Kind of the same situation with unbound in pfSense if those used in heavy loaded environment and the only cure then is to switch unbound into forwarding mode, then "the issue" is disappearing, but it is not possible to do with email servers since RBL won't accept queries from ISP's DNS as well 8.8.8.8 and "friends"

wcawijngaards commented 1 year ago

From the list of responses, I see that the state of the answers changes. So the names change from working to not working. I think it is strange that the older unbound version is working, but the newer one is not. Perhaps qname minimisation, an option that is on by default since 1.8 or so, is causing the servers to give wrong answers. Specifically, the higher NXDOMAIN answer is then turned into an NXDOMAIN for other names. This means the servers are not compliant with the standards spec that says they should not return NXDOMAIN for those nodes.

Qname minimisation can be turned off, qname-minimisation: no . If perhaps the qname-minimisation-strict option was turned on, perhaps turn that off. If it was this problem, then this explains how nameservers that get affected would after some period of working have an nxdomain for their address, and then also for other queries nxdomain for the address is returned, once the cache has received certain, higher level, nxdomains in cache.

CompuRoot commented 1 year ago

Thank you for the suggestion, yes, qname-minimisation it turned on on those servers. I will turn it off and will update if it resolves the issue

CompuRoot commented 1 year ago

Unfortunately nothing helps. unbound can not be used on email servers where antispam solutions utilizing well known online services which use DNS to answer, - if IP registered in spam database or not, and answering with nxdomain for clean IPs. To eliminate possible network disruption in tests, we queried in parallel using another resolver that confirms that remote DNS is working properly.

wcawijngaards commented 1 year ago

So, perhaps if 'harden-below-nxdomain' is turned off with the config, then the nameservers do not become nxdomains later on. That as a solution would indicate that the servers serving the nameserver records are sending nxdomain for intermediate labels, a protocol failure, perhaps it is a custom script or so. Or if the entire domain does not exist, I mean, the normal internet has intermediate nxdomains above these names, and the names are specifically entered by configuration, in which case there could be more recent fixes, but also turning off harden-below-nxdomain could alleviate that problem.

LamberMa commented 4 months ago

same issue, even if turn off harden-below-nxdomain and qname-minimisation to no. The problem still exists。The Unbound service needs to be restarted to restore functionality.

I'm using unbound 1.19

Error logs

Apr 09 18:09:14 unbound[26254:3] reply: 127.0.0.1 ipac.ctnsnet.com. A IN SERVFAIL 0.002092 0 45
Apr 09 18:09:14 unbound[26254:15] error: SERVFAIL <grow.nowcoder.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:15] reply: 127.0.0.1 grow.nowcoder.com. A IN SERVFAIL 0.001150 0 46
Apr 09 18:09:14 unbound[26254:18] error: SERVFAIL <hb-api.omnitagjs.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:18] reply: 127.0.0.1 hb-api.omnitagjs.com. A IN SERVFAIL 0.001932 0 49
Apr 09 18:09:14 unbound[26254:6] error: SERVFAIL <video.goofish.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:6] reply: 127.0.0.1 video.goofish.com. A IN SERVFAIL 0.002066 0 46
Apr 09 18:09:14 unbound[26254:13] error: SERVFAIL <s.company-target.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:13] reply: 127.0.0.1 s.company-target.com. A IN SERVFAIL 0.001939 0 49
Apr 09 18:09:14 unbound[26254:1a] error: SERVFAIL <kgnop3.kugou.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:1a] reply: 127.0.0.1 kgnop3.kugou.com. A IN SERVFAIL 0.002232 0 45
Apr 09 18:09:14 unbound[26254:3] error: SERVFAIL <mon11-misc-lf.fqnovel.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:3] reply: 127.0.0.1 mon11-misc-lf.fqnovel.com. A IN SERVFAIL 0.001992 0 54

Are there any other solutions?

LamberMa commented 4 months ago

when i set the verbosity to 3 , I found the debug logs below

Apr 09 18:57:37 unbound[9338:c] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:c] error: SERVFAIL <pull-flv-l11-cny.douyincdn.com. A IN>: exceeded the maximum number of sends
Apr 09 18:57:37 unbound[9338:8] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:8] error: SERVFAIL <res.wx.qq.com. A IN>: exceeded the maximum number of sends
Apr 09 18:57:37 unbound[9338:d] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:d] error: SERVFAIL <bd-adaptive-pull.video-voip.com.a.bcelive.com. A IN>: exceeded the maximum number of sends
Apr 09 18:57:37 unbound[9338:c] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:c] error: SERVFAIL <pull-hs-spe-f5.douyinliving.com. A IN>: exceeded the maximum number of sends

wcawijngaards commented 4 months ago

Output with verbosity 4 may be even more useful. And in particular show where those 33 packets are sent. And why would they not have an answer, this is perhaps visible in the logs. If you have IPv6 enabled but have no IPv6 connectivity, perhaps set do-ip6: no, that could save up half of those attempts. The value itself is configurable, max-sent-count: 32 is the default.

LamberMa commented 4 months ago

@wcawijngaards when i set the verbosity to 4 , I get the verbose log as below

Apr 09 20:28:35 unbound[16386:5] info: average recursion processing time 12.093231 sec
Apr 09 20:28:35 unbound[16386:5] info: histogram of recursion processing times
Apr 09 20:28:35 unbound[16386:5] info: [25%]=2.03448 median[50%]=6.94737 [75%]=21.4737
Apr 09 20:28:35 unbound[16386:5] info: lower(secs) upper(secs) recursions
Apr 09 20:28:35 unbound[16386:5] info:    0.000000    0.000001 1
Apr 09 20:28:35 unbound[16386:5] info:    0.008192    0.016384 1
Apr 09 20:28:35 unbound[16386:5] info:    0.016384    0.032768 1
Apr 09 20:28:35 unbound[16386:5] info:    0.032768    0.065536 1
Apr 09 20:28:35 unbound[16386:5] info:    0.065536    0.131072 3
Apr 09 20:28:35 unbound[16386:5] info:    0.131072    0.262144 6
Apr 09 20:28:35 unbound[16386:5] info:    0.262144    0.524288 5
Apr 09 20:28:35 unbound[16386:5] info:    0.524288    1.000000 7
Apr 09 20:28:35 unbound[16386:5] info:    1.000000    2.000000 17
Apr 09 20:28:35 unbound[16386:5] info:    2.000000    4.000000 29
Apr 09 20:28:35 unbound[16386:5] info:    4.000000    8.000000 19
Apr 09 20:28:35 unbound[16386:5] info:    8.000000   16.000000 18
/exceed
Apr 09 20:28:37 unbound[16386:9] debug: svcd callbacks end
Apr 09 20:28:37 unbound[16386:9] debug: serviced_delete
Apr 09 20:28:37 unbound[16386:9] debug: serviced send timer
Apr 09 20:28:37 unbound[16386:9] debug: EDNS lookup known=0 vs=0
Apr 09 20:28:37 unbound[16386:9] debug: serviced query UDP timeout=376 msec
Apr 09 20:28:37 unbound[16386:9] debug: inserted new pending reply id=c516
Apr 09 20:28:37 unbound[16386:9] debug: opened UDP if=0 port=8726
Apr 09 20:28:37 unbound[16386:9] error: udp connect failed: Network is unreachable for 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:37 unbound[16386:9] debug: svcd callbacks start
Apr 09 20:28:37 unbound[16386:9] debug: worker svcd callback for qstate 0x1a4b240
Apr 09 20:28:37 unbound[16386:9] debug: mesh_run: start
Apr 09 20:28:37 unbound[16386:9] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:37 unbound[16386:9] info: iterator operate: query gs-loc.apple.com. A IN
Apr 09 20:28:37 unbound[16386:9] info: iterator operate: chased to bluedot.is.autonavi.com.gds.alibabadns.com. A IN
Apr 09 20:28:37 unbound[16386:9] debug: process_response: new external response event
Apr 09 20:28:37 unbound[16386:9] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:37 unbound[16386:9] debug: query response was timeout
Apr 09 20:28:37 unbound[16386:9] debug: iter_handle processing q with state QUERY TARGETS STATE
Apr 09 20:28:37 unbound[16386:9] info: processQueryTargets: gs-loc.apple.com. A IN
Apr 09 20:28:37 unbound[16386:9] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 65
Apr 09 20:28:37 unbound[16386:9] debug: request has exceeded the maximum number of sends with 65
Apr 09 20:28:37 unbound[16386:9] debug: return error response SERVFAIL
Apr 09 20:28:37 unbound[16386:9] debug: mesh_run: iterator module exit state is module_finished
Apr 09 20:28:37 unbound[16386:9] error: SERVFAIL <gs-loc.apple.com. A IN>: exceeded the maximum number of sends
Apr 09 20:28:37 unbound[16386:9] debug: query took 2.577494 sec
Apr 09 20:28:37 unbound[16386:9] reply: 127.0.0.1 gs-loc.apple.com. A IN SERVFAIL 2.577494 0 45
Apr 09 20:28:37 unbound[16386:9] debug: query took 7.657016 sec
Apr 09 20:28:37 unbound[16386:9] reply: 127.0.0.1 gs-loc.apple.com. A IN SERVFAIL 7.657016 0 45
Apr 09 20:28:37 unbound[16386:9] debug: query took 12.572124 sec
Apr 09 20:28:37 unbound[16386:a] info: 1vRDCD mod0  ns4.zdns.google. A IN
Apr 09 20:28:37 unbound[16386:a] info: 2vRDCD mod0  ns1.zdns.google. AAAA IN
Apr 09 20:28:37 unbound[16386:0] debug: svcd callbacks start
Apr 09 20:28:37 unbound[16386:0] debug: worker svcd callback for qstate 0x2efa8e0
Apr 09 20:28:37 unbound[16386:0] debug: mesh_run: start
Apr 09 20:28:37 unbound[16386:0] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:37 unbound[16386:0] info: iterator operate: query ns3.yahoo.com. AAAA IN
Apr 09 20:28:37 unbound[16386:0] debug: process_response: new external response event
Apr 09 20:28:37 unbound[16386:0] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:37 unbound[16386:5] info:   k.gtld-servers.net. * A AAAA
Apr 09 20:28:37 unbound[16386:5] info:   l.gtld-servers.net. * A AAAA
Apr 09 20:28:37 unbound[16386:5] info:   h.gtld-servers.net. * A AAAA
/exceed
Apr 09 20:28:39 unbound[16386:d] info: 0vRDCDd mod0  ns-383.awsdns-47.com. A IN
Apr 09 20:28:39 unbound[16386:1] info: [25%]=6.33333 median[50%]=97.1294 [75%]=156.055
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip6 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:2] error: udp connect failed: Network is unreachable for 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:3] info: sending query: tm1.edgedns-tm.info. A IN
Apr 09 20:28:39 unbound[16386:3] debug: sending to target: <info.> 2001:500:1a::1#53
Apr 09 20:28:39 unbound[16386:3] debug: dnssec status: not expected
Apr 09 20:28:39 unbound[16386:3] debug: mesh_run: iterator module exit state is module_wait_reply
Apr 09 20:28:39 unbound[16386:3] info: average recursion processing time 56.002378 sec
Apr 09 20:28:39 unbound[16386:3] info: histogram of recursion processing times
Apr 09 20:28:39 unbound[16386:e] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:e] debug: servselect ip6 2001:503:39c1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:e] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:e] debug: servselect ip6 2001:501:b1f9::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:e] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.43.172.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.35.51.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:503:a83e::2:30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.5.6.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:503:eea3::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.42.93.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.54.112.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:500:d937::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.41.162.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:503:d2d::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.52.178.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:503:83eb::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.26.92.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug:    ip6 2001:500:856e::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug:    ip4 192.31.80.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: rpz: iterator module callback: have_rpz=0
Apr 09 20:28:39 unbound[16386:6] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip6 2001:500:d937::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:6] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:2] debug: svcd callbacks start
Apr 09 20:28:39 unbound[16386:2] debug: worker svcd callback for qstate 0x10246a0
Apr 09 20:28:39 unbound[16386:2] debug: mesh_run: start
Apr 09 20:28:39 unbound[16386:2] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:39 unbound[16386:2] info: iterator operate: query ssl.gstatic.com. A IN
Apr 09 20:28:39 unbound[16386:3] info: [25%]=8.66667 median[50%]=55.6364 [75%]=92.7619
Apr 09 20:28:39 unbound[16386:5] error: udp connect failed: Network is unreachable for 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.33.14.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:5] debug: svcd callbacks start
Apr 09 20:28:39 unbound[16386:5] debug: worker svcd callback for qstate 0x14658f0
Apr 09 20:28:39 unbound[16386:5] debug: mesh_run: start
Apr 09 20:28:39 unbound[16386:5] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:39 unbound[16386:5] info: iterator operate: query api.bilibili.com. A IN
Apr 09 20:28:39 unbound[16386:5] info: iterator operate: chased to a.w.bilicdn1.com. A IN
Apr 09 20:28:39 unbound[16386:5] debug: process_response: new external response event
Apr 09 20:28:39 unbound[16386:5] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:39 unbound[16386:5] debug: query response was timeout
Apr 09 20:28:39 unbound[16386:5] debug: iter_handle processing q with state QUERY TARGETS STATE
Apr 09 20:28:39 unbound[16386:5] info: processQueryTargets: api.bilibili.com. A IN
Apr 09 20:28:39 unbound[16386:5] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 65
Apr 09 20:28:39 unbound[16386:5] debug: request has exceeded the maximum number of sends with 65
Apr 09 20:28:39 unbound[16386:5] debug: return error response SERVFAIL
Apr 09 20:28:39 unbound[16386:5] debug: mesh_run: iterator module exit state is module_finished
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.43.172.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug:    rtt=1034
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.55.83.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug:    rtt=946
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.48.79.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug:    rtt=1539
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.33.14.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug:    rtt=1409
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.12.94.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug:    rtt=900
Apr 09 20:28:39 unbound[16386:9] debug: selrtt 376
Apr 09 20:28:39 unbound[16386:9] info: sending query: drive.wpsdns.com. A IN
Apr 09 20:28:39 unbound[16386:9] debug: sending to target: <com.> 2001:503:39c1::30#53
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:502:7094::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.48.79.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:501:b1f9::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.55.83.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:503:39c1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.43.172.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.35.51.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:503:a83e::2:30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.5.6.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:503:eea3::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.42.93.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.54.112.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:500:d937::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.41.162.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:503:d2d::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.52.178.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:503:83eb::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.26.92.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug:    ip6 2001:500:856e::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    ip4 192.31.80.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: rpz: iterator module callback: have_rpz=0
Apr 09 20:28:39 unbound[16386:f] debug: servselect ip6 2001:502:1ca1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:f] debug: servselect ip6 2001:503:231d::2:30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:f] debug: servselect ip6 2001:502:7094::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:3] info:    0.000000    0.000001 2
Apr 09 20:28:39 unbound[16386:3] info:    0.001024    0.002048 1
Apr 09 20:28:39 unbound[16386:3] info:    0.002048    0.004096 1
Apr 09 20:28:39 unbound[16386:a] info:   h.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   g.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   a.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   f.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   i.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   m.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   j.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   b.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info:   e.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:1] info: lower(secs) upper(secs) recursions
Apr 09 20:28:39 unbound[16386:a] debug:    ip6 2001:502:1ca1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:0] info: sending query: ns3.yahoo.com. AAAA IN
Apr 09 20:28:39 unbound[16386:0] debug: sending to target: <com.> 2001:503:39c1::30#53
Apr 09 20:28:39 unbound[16386:0] debug: dnssec status: not expected
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip6 2001:501:b1f9::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:6] debug:    rtt=376
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip4 192.33.14.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:6] debug:    rtt=1363
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip4 192.55.83.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:2] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:39 unbound[16386:2] debug: query response was timeout
Apr 09 20:28:39 unbound[16386:2] debug: iter_handle processing q with state QUERY TARGETS STATE
Apr 09 20:28:39 unbound[16386:2] info: processQueryTargets: ssl.gstatic.com. A IN
Apr 09 20:28:39 unbound[16386:2] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 8
Apr 09 20:28:39 unbound[16386:2] info: DelegationPoint<com.>: 13 names (0 missing), 26 addrs (26 result, 0 avail) cacheNS
Apr 09 20:28:39 unbound[16386:2] info:   d.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:2] info:   c.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:c] info: 14vRDCDd mod0  h5-f5gtm01-lsnr02.eset.com. AAAA IN
Apr 09 20:28:39 unbound[16386:c] info: 15RDdc mod0 rep dmd.metaservices.microsoft.com. A IN
Apr 09 20:28:39 unbound[16386:d] info: 1vRDCDd mod0  ns-383.awsdns-47.com. AAAA IN
Apr 09 20:28:39 unbound[16386:d] info: 2vRDCDd mod0  n6dsce9.akamaiedge.net. AAAA IN
Apr 09 20:28:39 unbound[16386:a] debug:    ip4 192.12.94.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:d] info: 3vRDCDd mod0  n1dscgi3.akamai.net. AAAA IN
Apr 09 20:28:39 unbound[16386:d] info: 4vRDCDd mod0  n6dscgi3.akamai.net. AAAA IN
Apr 09 20:28:39 unbound[16386:d] info: 5RDd mod0 rep w2.kskwai.com. A IN
Apr 09 20:28:39 unbound[16386:0] debug: mesh_run: iterator module exit state is module_wait_reply
Apr 09 20:28:39 unbound[16386:0] info: mesh_run: end 12 recursion states (6 with reply, 0 detached), 13 waiting replies, 208 recursion replies sent, 0 replies dropped, 0 states jostled out

Adjust max-sent-count to 64，but the problem still exist. From the log, it can be seen that '208 recursion replies were sent'. I don't know if this is related to max-sent-count.

I tried modifying max-sent-count to 256 and conducted another test. I found that the probability of failure decreased significantly, with only a few cases where the domain name resolution results could not be obtained. However, these few unresolvable cases are acceptable. Therefore, increasing max-sent-count can indeed alleviate the problem I mentioned above. But I'm not sure if this is a best practice, but from the perspective of user experience, it does seem to work better.

CompuRoot commented 4 months ago

Are there any other solutions?

Yes, other resolvers.
Unfortunately unbound is unusable on servers and border gateways due to hard-coding on number of sends. Because of this, we removed it from all servers.

wcawijngaards commented 4 months ago

The interesting failure in the logs is udp connect failed: Network is unreachable for 2001:503:d414::30. It seems that udp-connect: no could be a solution. Or are there IPv6 routing problems, and perhaps do-ip6: no is a solution.

So, what makes it fail to work, on servers and border gateways? The maximum number of sends is configurable, nowadays, and you can increase it if you like, with max-sent-count:. But why would that fail so hard, I mean 30 packets is a lot, most DNS queries should work in only several. The counter is the resolver trying to send a packet, and somehow that fails a lot.

CompuRoot commented 4 months ago

But why would that fail so hard, I mean 30 packets is a lot, most DNS queries should work in only several.

If unbound supposed to work on a single office computer that serves a regular user and all it does is substitute name=>IP, then yes, you right, but for the servers or specific gateways, overriding remote NS replies - it breaks protocol and workflow that expects huge amount of NXDOMAIN.

I already tried to explain it, as of now, instead of to continue delivery to DNS consumer exact NS server's reply, such as NXDOMAIN, unbound trying to play to be a good client that "protect the world from bad a guys", that trying to avoid to abuse remote NS, and after certain amount of queries converts NXDOMAIN into SERVFAIL which isn't true. In fact, on email servers, it is basically is the same as a MITM - that disables email servers to classify connecting domains - if those are spammers or not and in the end unbound helping spammers to sneak their dirty messages. unbound as a client trying to do servers/infrastructure job of those who exposed NS service and have to protect itself on its own from DDoS.

Overriding NXDOMAIN to SERVFAULT breaking antispam services that works for decades over DNS, where NXDOMAIN is just a flag, that domain in question either is NX (non exist in their spam list) or resolves it to some 127.x.x.x reply classifying those as a spammers, but unbound simply eliminate such services from doing there jobs.

max-sent-count obviously is a pretty good choice for a public services (hospitals, libraries, campuses...) that will protect from malicious users/app to abuse NS servers, but it can't help for the cases described above, on email servers (of science/programming facilities) who has a legal reasons to query remote NS and expects to get from remote NS exact replies instead of decision made by intermediate cashing forwarder.

Im sorry upfront for a long and expressive post, but the only intent of this - is to help to understanding the issue.

wcawijngaards commented 4 months ago

No, it is not that 30 nxdomain's is an issue or that unbound refuses nxdomain responses. What is also awkward is that the original post, and the person that had a problem just before you commented and you are talking about different things. The original poster talks about the error of exceeding the nameserver nxdomains, the previous poster has trouble sending packets and hits limits on number of sent packets, and it seems you would like anti spam domains to resolve correctly.

Unbound of course wants to store the NXDOMAIN responses and return them to the querier. It also wants to reply with the 127.x responses that antispam domains use. For that it performs lookups to get that data.

The maximum nameserver nxdomains exceeded error, is an error from the recursion, where nameserver lookups, not the target domain name, but a nameserver, or a nameserver for a nameserver (recursively) has an nxdomain answer. This is unexpected since nameservers, by their purpose, have addresses. It is also not the target query that the client makes. There was a limit for overload prevention set on the number of failed nameservers, and this is what the nameserver nxdomain exceeded error is about. It is not actually about the target query, if its is NXDOMAIN of 127.x, just its nameservers.

So unbound is copying the nameserver reply to the client, for that target query. And even, also for nameserver lookups, would store for the nameserver lookup itself that it was nxdomain for example. But it stops after too much work is expended.

Both the max number of nameserver nxdomains and max sent count are maximums on amount of work expended. They are set to be reasonable, eg. normal domains should resolve. Also anti spam domains should work just fine.

If the limits are exceeded unbound returns SERVFAIL, the resolver has failed to look up the name. It is not trying to change the answer, but reporting that it could not do the lookup.

So maybe we can figure out, since the complaint is that something is wrong, what exactly is going wrong? That must be different stuff, it seems to me, because of different errors, and perhaps those are different issues that are talked about. So increase verbosity and perhaps it can reveal what happens those countless times. The inability to send packets should stop any server from functioning, and also getting no address for nameserver lookups. Possibly also, there are bugs in the code, that prevent proper functioning, or create these failures, but I have not been able to find any (in this ticket report).

CompuRoot commented 4 months ago

The maximum nameserver nxdomains exceeded error, is an error from the recursion,

No. It doesn't work like a regular DNS query. To be able to prevent abuse and commercialize it, most of antispam databases that utilize DNS mechanism - requires to query them directly, from specific, own email server's static IP. If one would use intermediate recursor, like 8.8.8.8 or ISP's DNS, then those queries will be rejected by antispam database providers.

I don't know how to explain it more clearly, - NXDOMAIN reply is NOT an error, but flag, that indicates that queried domain is NOT EXISTS in spam list. It isn't related to a common DNS name<=>IP resolution

But it stops after too much work is expended. Both the max number of nameserver nxdomains and max sent count are maximums on amount of work expended

And that's why the only unbound can not be used on email servers. unbound decides when it will stop working, even so, workflow described above EXPECTS to get huge amount of NXDOMAIN from the same, specific DNS servers and even so it would be the same query sent many times to them. Remote servers in such cases aren't abused, it just how it works: remote DNS server reply with NX in case queried domain isn't listed as a spammer, but it doesn't means that domain in question won't be banned in a next hour if they might start sending spam, and that's why it is Ok to query the same domain multiple times, often, even so it was NXDOMAIN for a last hundreds/thousands queries before.

If the limits are exceeded unbound returns SERVFAIL

UNBOUNDFAIL - would be more appropriate response, not a SERVFAIL

SERVFAIL - should means that unbound can't talk anymore to a server because of SERVER get really FAILED, not because a DNS client decides that it get tired, and interpret NXDOMAIN in its own way and refuse to do its job because of some unbound only hard-coded limit get reached.

So maybe we can figure out, since the complaint is that something is wrong, what exactly is going wrong?

Wrong is: DNS client must simply do its job without assumption that multiple NXDOMAIN answers are equal to SERVFAIL. unbound shouldn't be a judge also assuming that remote server get abused and return to queerer false DNS reply SERVFAIL even when server still alive and reply with appropriate DNS answers, even those answers are NXDOMAIN million times for the same query over short period of time.

So increase verbosity and perhaps it can reveal what happens those countless times.

There no errors between server and unbound, verbosity won't help. The error is in the unbound that invented a new way to substitute NXDOMAIN with SERVFAIL after some unique to unbound only, hard-coded measuring. This is a logical error, that pop up after some amount of queries, when unbound stopped processing queries from its clients and returns false reply SERVFAIL while other DSN clients in comparison, - successfully continue to work.

Querying antispam databases over DNS, questioner expects to get huge amount of NXDOMAIN from specific DNS server that utilizing DNS protocol. DNS isn't just for resolving name/IP, but also other key/value pairs related to domains.

I hope that someone might check how MTA and antispam databases works for decades, that utilizing DNS, when making decision on email reception. Bellow are just a few such providers which you can check:

zen.spamhaus.org
b.barracudacentral.org
bl.spamcop.net
bl.spameatingmonkey.net
dnsbl.sorbs.net
psbl.surriel.com
bl.mailspike.net
list.dnswl.org

as well there are tens other similar providers.

The same techniques also used in corporate environments on DNS level, where reply from upstream specific DNS providers has different meaning for NXDOMAIN, that might indicate that it listed on not in some specific database, for some specific reason, for specific short time, but can be resolved to some specific reply in case it needed.

wcawijngaards commented 4 months ago

Yes, of course Unbound should be able to resolve antispam domains, hopefully without bugs.

There seems to be a misunderstanding. The nxdomain from the error versus other nxdomains. The nxdomain for an antispam lookup it not the topic of the error. There can be any number of those nxdomains. It is another type of lookup, the lookup for the address of a DNS server, that is made by unbound itself, that receives an, unexpected, nxdomain. Perhaps I can illustrate it with a short example, suppose there is a query for www.example.com, then that could be nxdomain. That is fine and does not cause the error printout, and there can be any number of those nxdomain replies. To resolve example.com names Unbound may need to, for example, lookup DNS servers, and it decides to look up the IP addresses of ns1.example.com, ns2.example.com, ns3.example.com ... These then return nxdomains, and this is what the error is about, if that happens too many times, the algorithm gives up. In an attempt to stop certain overload situations where there could be infinite such look ups.

The name SERVFAIL is from the RFC on the protocol, and registered, it is not made up. When unbound wants to print errors, it can print more detailed ones than that code. The other commenter is already using that with log-servfail: yes, that logs a short description of the failure. Sometimes useful to cut to the chase for like DNSSEC failures, it prints some reason for the lack of response. Also it is possible to have descriptive errors in replies, with new options like ede: yes.

So other DNS clients work, and I wonder why, because unbound is not actually refusing the nxdomain for the queried name. There is something happening with other names, and you surmise with the overload protection mechanism that causes unbound to gives and not resolve any further.

My guess is, that there is some sort of issue around the antispam domains commercialised response method, and Unbound's protection mechanism. I mean, somehow Unbound gets upset about the nonresponse that the email confidence provider has for its DNS servers, even though I guess you have access. It could be perhaps something with the nameservers, the DNS servers for the antispam domain name, and how they respond to the registered user. I guess this because unbound complains something about the DNS server addresses and the commercialised access method, could, also make that different from an ordinary DNS resolution. But perhaps my guess is wrong, so it is fine to discuss further for investigating this issue.

CompuRoot commented 4 months ago

There seems to be a misunderstanding.

Unfortunately it is.

The name SERVFAIL is from the RFC on the protocol, and registered, it is not made up.

According to RFC:

Server failure - The name server was unable to process this query due to a problem with the name server.

It isn't means a failure if remote server returns - many NXDOMAIN. As far as it functioning according to protocol - it is still alive.

I can not understand, where this conclusion is come from - that receiving many NXDOMAIN replies can be classified as SERVFAIL. NXDOMAIN is one of standardized reply that remote NS server returning and it is THEIR AUTHORITY to decide how to answer. As far as it working according to protocol, it isn't SERVFAIL, regardless how many NXDOMAIN it returning

These then return nxdomains, and this is what the error is about, if that happens too many times, the algorithm gives up.

NXDOMAIN is not a error, it is authoritative NS reply. "too many times NXDOMAIN" is not SERVFAIL, it is remote server reply and it is their authority to decide, not a DNS client that make conclusion - "if it's too many or not"

My guess is, that there is some sort of issue around the antispam domains commercialised response method

No. It isn't related. Any other solutions like PowerDNS, KnotDNS, CoreDNS... works with the same providers without a problem. The only unbound interpreting "many NXDOMAIN" as SERVFAIL and stopped processing query until it get restarted.

wcawijngaards commented 4 months ago

Yes it is fine for the upstream server to return NXDOMAIN for the target query.

The change from nxdomain to servfail is for the lookup of the addresses of nameservers. This where failed lookups make unbound give up and fail, and then return servfail. This is not nxdomain for the target query from the end client, but nxdomain for the addresses of name servers that unbound needs to send queries to.

catap commented 4 months ago

I do have similar issue with similar setup: mail server with unbound which is used for RBL.

Anyway, my setup is quite small server with a few emails and issue that unbound replies as SERVFAIL happens a few times per day.

Due to that small traffic I was able to enable verbose log to capture an example.

Environment: OpenBSD 7.5 with unbound 1.18.0 which is running with config:

server:
    interface: 127.0.0.1
    interface: ::1

    do-ip6: no

    access-control: 0.0.0.0/0 refuse
    access-control: 127.0.0.0/8 allow
    access-control: ::0/0 refuse
    access-control: ::1 allow

    hide-identity: yes
    hide-version: yes

    auto-trust-anchor-file: "/var/unbound/db/root.key"

    val-log-level: 2
    verbosity: 5
    log-servfail: yes
    ede: yes

    aggressive-nsec: yes

    domain-insecure: "local."
    private-domain: "local."

    do-not-query-localhost: no

remote-control:
    control-enable: yes
    control-interface: /var/run/unbound.sock

stub-zone:
    name: "bl.local."
    stub-addr: 127.0.0.2

stub-zone:
    name: "wl.local."
    stub-addr: 127.0.0.2

I use DNSBL filter for OpenSMTPD which uses gethostbyname_async to run multiple queries in parallel against unbound.

After enabling verbose logs at unbound I've discovered the following logs from OpenSMTPD:

Apr 20 18:37:01 mx1 smtpd[3772]: 249926c85ae61fc4 smtp connected address=194.48.251.196 host=<unknown>
Apr 20 18:38:16 mx1 smtpd[6188]: dnsbl: 249926c85ae61fc4 DNS error 2 on dnsbl.sorbs.net
Apr 20 18:38:16 mx1 smtpd[3772]: 249926c85ae61fc4 smtp disconnected reason=disconnect

Here, someone (spammer?) had connected to SMTPD which triggers series of DNS requests and in 1 minute (I guess it is default timeout for gethostbyname_async) the error / response 2 had happened. Let me quote possible responses:

#define NETDB_INTERNAL  -1  /* see errno */
#define NETDB_SUCCESS   0   /* no problem */
#define HOST_NOT_FOUND  1 /* Authoritative Answer Host not found */
#define TRY_AGAIN   2 /* Non-Authoritative Host not found, or SERVERFAIL */
#define NO_RECOVERY 3 /* Non recoverable errors, FORMERR, REFUSED, NOTIMP */
#define NO_DATA     4 /* Valid name, no data record of requested type */
#define NO_ADDRESS  NO_DATA     /* no address */

Ok, now unbound. I've extracted all it's logs from 18:37:00 until 18:38:59 which includes only DNS requests which was associeted with that client. As I said, I do have a little traffic. This log file is reducted: I replaced API key for one list, but it is the only changes that I did. unbound.log

I tried to understand it, but I do not understand why it continue to repeat the same request over and over.

catap commented 4 months ago

I think I got it.

NXDOMAIN uses by cache layer for the case when cache hasn't got it.

So, when DNSBL returns NXDOMAIN... unbound doesn't cache it because it means no cache, and tries another NS server for that zone, with hope that it resposnes something.

@wcawijngaards is it possible to treat NXDOMAIN from upstream as final decision and do not ask the next NS server?

wcawijngaards commented 4 months ago

It does treat nxdomain from upstream as a final answer and then returns that to the client. Not sure what is going wrong for you. But just for getting nxdomain unbound should not be asking the next server.

catap commented 4 months ago

@wcawijngaards see attached log where unbound repeats the same request over and over.

wcawijngaards commented 4 months ago

The logs do not have SERVFAIL in them. There is nxdomain, and I see why it makes a second query when an nxdomain. It is performing qname minimisation. And some servers wrongly report NXDOMAIN, so it checks to make sure if the full query name is also nxdomain. Perhaps qname-minimisation: no helps out. Which request is repeated?

catap commented 4 months ago

It repeats 196.251.48.194.dnsbl.sorbs.net.. I'll put qname-minimisation: no to that server. If issue appears again, will re-collect verbose log.

wcawijngaards commented 4 months ago

Can you set verbosity: 4 because a lot of information was missing, or use a logfile instead of syslog if syslog and high verbosity was the result of the previous log perhaps that dropped a lot of output. Otherwise, it looks like sorbs.net is just not responding to queries, and unbound is trying several different servers, and at last, one responds with NXDOMAIN and that is the answer that unbound uses, that is in the log.

catap commented 4 months ago

Can you set verbosity: 4 because a lot of information was missing, or use a logfile instead of syslog if syslog and high verbosity was the result of the previous log perhaps that dropped a lot of output. Otherwise, it looks like sorbs.net is just not responding to queries, and unbound is trying several different servers, and at last, one responds with NXDOMAIN and that is the answer that unbound uses, that is in the log.

I do have the same behaviour from different DNS BL sdervers, here statistics for the last 24h:

   2 all.spamrats.com
  36 b.barracudacentral.org
  23 bl.spamcop.net
   2 combined.mail.abusix.zone
   8 dnsbl.sorbs.net
   2 list.dnswl.org

this numbers on my small mail server with quite less traffic.

But I assume that all of them is using the same software to run DNS: https://rbldnsd.io/

wcawijngaards commented 4 months ago

Are those servfails of some sort? Unbound has the option log-servfail: yes that can print one line error responses, that you can inspect to see what is going on.

catap commented 4 months ago

qname-minimisation: no doesn't help. The settings looks like:

    val-log-level: 2
    # verbosity: 5
    log-servfail: yes
    ede: yes
    qname-minimisation: no

and I just had in smtpd logs:

Apr 25 16:34:08 mx2 smtpd[21876]: dnsbl: 9618a2e89fa3d52f DNS error 2 on b.barracudacentral.org

that means that gethostbyname returns:

#define TRY_AGAIN   2 /* Non-Authoritative Host not found, or SERVERFAIL */

in parallel unbound log contains:

Apr 25 16:34:08 mx2 unbound: [59688:0] error: SERVFAIL <99.205.141.95.b.barracudacentral.org. A IN>: all servers for this domain failed, at zone b.barracudacentral.org. no server to query nameserver addresses not usable

as you asked switched to verbocity: 4

wcawijngaards commented 4 months ago

But I give the disable of qname minimisation a fair chance to fix it; since the server specifically generates data, and that may mean that the intermediate labels of the lookup names could be trouble, since that is typically something that qname minimisation targets and that may be surprising behaviour.

wcawijngaards commented 4 months ago

I wonder what causes the nameservers to be not usable for the query, the verbosity: 4 logs can maybe show what the responses look like, eg. why it is not a usable response that is returned. The verbosity level 5 is good too, 4 or higher.

catap commented 4 months ago

The verbosity level 5 is good too, 4 or higher.

Verbosity 5 was attached, now I'm waiting until it's happened again to capture with verbosity 4.

wcawijngaards commented 4 months ago

Wait that shortened log was verbosity 5, if so, could you use the log to file function. It seems that the syslog that could be in use drops a lot of stuff, like debug category or lots of output.

catap commented 4 months ago

Wait that shortened log was verbosity 5, if so, could you use the log to file function. It seems that the syslog that could be in use drops a lot of stuff, like debug category or lots of output.

can you share that should I put into config to do it?

wcawijngaards commented 4 months ago

With like logfile: "/root/unbound.log" it would log to file instead of syslog and stuff like debug output or lots of output does not get dropped.

catap commented 4 months ago

@wcawijngaards switched to:

    val-log-level: 2
    verbosity: 5
    log-servfail: yes
    ede: yes
    qname-minimisation: no
    logfile: "/var/unbound/db/unbound.log"

catap commented 4 months ago

and I've got it. Mailserver logs contains:

Apr 25 18:08:57 mx1 smtpd[71942]: 9f3ec9c731c1ab0e smtp connected address=89.113.156.156 host=<unknown>
Apr 25 18:10:12 mx1 smtpd[98148]: dnsbl: 9f3ec9c731c1ab0e DNS error 2 on b.barracudacentral.org
Apr 25 18:10:12 mx1 smtpd[71942]: 9f3ec9c731c1ab0e smtp disconnected reason=disconnect

So, here the log between 1714061200 and 1714061500: unbound-range.log.gz

catap commented 4 months ago

Log contains UDP timeouts. What puzzles me is that I see it on both servers at random time but for the few zones for dozens.

Thus, one server in Germany, the second one in Finland.

Something very strange here

catap commented 4 months ago

Ok, I"ve run tcpdump on one of servers for about 30 minutes, and it had proved that here almost 10% of quereis without reply.

How can I enforce TCP connection to upstream servers?

catap commented 4 months ago

Welll... tcp-upstream: yes doesn't help because DNSBL NS is UDP only.

Huh.

CompuRoot commented 4 months ago

How can I enforce TCP connection to upstream servers?

The problem is that unbound after some hardcoded number of attempts marks such remote servers as SERVFAIL. But RBL servers can be silent for multiple numbers of reasons, like currently abused/DDoS or reach limit in resources due to their nature to handle huge amount of requests or requester reached free/bought limits (per/hour, per/day) or it something else in a middle.

Other DNS clients still continues to query, regardless if there was temporary outage, but unbound marks such as SERVFAIL until it get restarted.