Open gzzhangxinjie opened 3 years ago
When looking up the addresses for nameservers, unbound encounters too many NXDOMAIN responses for those lookup and stops to avoid causing a denial of service. The domain has a long list of NS records, and those domains perhaps also have lists of NS records. All of those, or a lot of them, have no addresses and thus do not work. While trying to resolve the domain unbound is recursively looking up those nameservers and the nameservers to lookup those nameservers, and this is taking too much resources.
The domain should not have such a long list of nameservers that do not have addresses. Or nameservers for the nameservers that have no addresses. There was a CVE for that a while ago about this resource consumption causing issues, too many queries, too much resource usage on the DNS server. This error stops the resource usage.
It however seems that exceeded the maximum nameserver nxdomains
massively appears for otherwise normal domains after a temporary network connectivity loss (it seems unbound does not distinguish here between actual authorative NXDOMAIN
and temporary network unavailability) and marks all domains which it tried to access while network was down as NXDOMAIN
.
This leads to admin needing to restart unbound server manually after every WLAN/network connectivity issue, which makes it basically unusable as caching DNS server.
Debian Bullseye GNU/Linux, unbound 1.13.2-1 (using forward-tls-upstream: yes
, forward-first: yes
, forward-addr:
)
e.g.
I have observed the same issue.
Today i've updated my openwrt and afterwards i had to restart unbound inside my container...
Sep 08 07:20:26 unbound[837545:0] error: SERVFAIL <www.iana.org. AAAA IN>: exceeded the maximum number of sends
Sep 08 07:20:26 unbound[837545:0] error: SERVFAIL <www.iana.org. AAAA IN>: exceeded the maximum number of sends
Sep 08 07:20:28 unbound[837545:1] error: SERVFAIL <cocoapi.bmwgroup.com. AAAA IN>: exceeded the maximum number of sends
Sep 08 07:20:41 unbound[837545:1] error: SERVFAIL <nv2-namain.netatmo.net. A IN>: exceeded the maximum number of sends
Sep 08 07:20:41 unbound[837545:1] error: SERVFAIL <nv2-namain.netatmo.net. A IN>: exceeded the maximum number of sends
Sep 08 07:20:45 unbound[837545:0] error: SERVFAIL <diag.meethue.com. AAAA IN>: exceeded the maximum nameserver nxdomains
Sep 08 07:21:02 unbound[837545:0] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:02 unbound[837545:0] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:11 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:12 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:12 unbound[837545:1] info: validation failure <account.xiaomi.com. AAAA IN>: key for validation xiaomi.com. is marked as invalid
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. no server to query nameserver addresses not usable
Sep 08 07:21:20 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. no server to query nameserver addresses not usable
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: exceeded the maximum number of sends
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: exceeded the maximum number of sends
Sep 08 07:21:20 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:26 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:27 unbound[837545:1] error: SERVFAIL <rt.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:37 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:37 unbound[837545:1] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:40 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. A IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:21:56 unbound[837545:0] error: SERVFAIL <iot.telemetry.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. upstream server timeout
Sep 08 07:22:06 unbound[837545:1] error: SERVFAIL <api.wall-box.com. AAAA IN>: all servers for this domain failed, at zone wall-box.com. no server to query nameserver addresses not usable
after restart, everything was fine again...
I see the same problem here, and have to regularly restart unbound after my RSS reader tries to fetch all of my feeds. (I admit that I read lots of RSS feeds. This ncludes some feeds on no longer existing domains - but I keep them in my archive, and my RSS reader unfortunately still tries to fetch them.)
Is there any solution or workaround (like a config option to raise the query and/or nxdomain limit) for this issue?
same here! expecting for fast resolution
is there a way to instruct the unbound to not account nxdomains if there is no connection with upstream resolver pointed by forward-zone: forward-addr: ??? and log the connection related issue ?
i may do periodic check logs and when i encounter a such situation raise a SIGHUP for unbound process but its hackish ugly solution .....
root@dl360g7:~# unbound -V Version 1.16.2
Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-pythonmodule --with-pyunbound --enable-subnet --enable-dnstap --enable-systemd --with-libnghttp2 --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --disable-rpath --with-pidfile=/run/unbound.pid --with-libevent --enable-tfo-client --with-rootkey-file=/usr/share/dns/root.key --enable-tfo-server Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.5 5 Jul 2022 Linked modules: dns64 python subnetcache respip validator iterator TCP Fastopen feature available
ct 27 20:57:43 unbound[124347:1] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains
Oct 27 20:57:43 unbound[124347:1] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains
Oct 27 20:57:43 unbound[124347:0] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains
Oct 27 20:57:43 unbound[124347:2] error: SERVFAIL <a.nel.cloudflare.com. A IN>: exceeded the maximum nameserver nxdomains
Oct 27 20:57:46 unbound[124347:3] info: server stats for thread 3: 7072191 queries, 6857794 answers from cache, 214397 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 27 20:57:46 unbound[124347:3] info: server stats for thread 3: requestlist max 90 avg 1.21422 exceeded 0 jostled 0
Oct 27 20:57:46 unbound[124347:3] info: average recursion processing time 1.206027 sec
Oct 27 20:57:46 unbound[124347:3] info: histogram of recursion processing times
Oct 27 20:57:46 unbound[124347:3] info: [25%]=0.0793683 median[50%]=0.174644 [75%]=1.0103
Oct 27 20:57:46 unbound[124347:3] info: lower(secs) upper(secs) recursions
Oct 27 20:57:46 unbound[124347:3] info: 0.000000 0.000001 1490
Oct 27 20:57:46 unbound[124347:3] info: 0.000016 0.000032 10
Oct 27 20:57:46 unbound[124347:3] info: 0.000032 0.000064 14
Oct 27 20:57:46 unbound[124347:3] info: 0.000064 0.000128 18
Oct 27 20:57:46 unbound[124347:3] info: 0.000128 0.000256 44
Oct 27 20:57:46 unbound[124347:3] info: 0.000256 0.000512 223
Oct 27 20:57:46 unbound[124347:3] info: 0.000512 0.001024 2985
Oct 27 20:57:46 unbound[124347:3] info: 0.001024 0.002048 17083
Oct 27 20:57:46 unbound[124347:3] info: 0.002048 0.004096 1885
Oct 27 20:57:46 unbound[124347:3] info: 0.004096 0.008192 1443
Oct 27 20:57:46 unbound[124347:3] info: 0.008192 0.016384 1238
Oct 27 20:57:46 unbound[124347:3] info: 0.016384 0.032768 856
Oct 27 20:57:46 unbound[124347:3] info: 0.032768 0.065536 16046
Oct 27 20:57:46 unbound[124347:3] info: 0.065536 0.131072 48631
Oct 27 20:57:46 unbound[124347:3] info: 0.131072 0.262144 45822
Oct 27 20:57:46 unbound[124347:3] info: 0.262144 0.524288 19299
Oct 27 20:57:46 unbound[124347:3] info: 0.524288 1.000000 3482
Oct 27 20:57:46 unbound[124347:3] info: 1.000000 2.000000 22201
Oct 27 20:57:46 unbound[124347:3] info: 2.000000 4.000000 13114
Oct 27 20:57:46 unbound[124347:3] info: 4.000000 8.000000 14462
Oct 27 20:57:46 unbound[124347:3] info: 8.000000 16.000000 3075
Oct 27 20:57:46 unbound[124347:3] info: 16.000000 32.000000 486
Oct 27 20:57:46 unbound[124347:3] info: 32.000000 64.000000 288
Oct 27 20:57:46 unbound[124347:3] info: 64.000000 128.000000 45
Oct 27 20:57:46 unbound[124347:3] info: 128.000000 256.000000 109
Oct 27 20:57:46 unbound[124347:3] info: 256.000000 512.000000 48
Oct 27 20:57:47 unbound[124347:0] info: server stats for thread 0: 6956869 queries, 6746181 answers from cache, 210688 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 27 20:57:47 unbound[124347:0] info: server stats for thread 0: requestlist max 79 avg 1.18437 exceeded 0 jostled 0
Oct 27 20:57:47 unbound[124347:0] info: average recursion processing time 1.236167 sec
Oct 27 20:57:47 unbound[124347:0] info: histogram of recursion processing times
Oct 27 20:57:47 unbound[124347:0] info: [25%]=0.0791236 median[50%]=0.173051 [75%]=1.00855
Oct 27 20:57:47 unbound[124347:0] info: lower(secs) upper(secs) recursions
Oct 27 20:57:47 unbound[124347:0] info: 0.000000 0.000001 1417
Oct 27 20:57:47 unbound[124347:0] info: 0.000008 0.000016 1
Oct 27 20:57:47 unbound[124347:0] info: 0.000016 0.000032 11
Oct 27 20:57:47 unbound[124347:0] info: 0.000032 0.000064 14
Oct 27 20:57:47 unbound[124347:0] info: 0.000064 0.000128 22
Oct 27 20:57:47 unbound[124347:0] info: 0.000128 0.000256 44
Oct 27 20:57:47 unbound[124347:0] info: 0.000256 0.000512 241
Oct 27 20:57:47 unbound[124347:0] info: 0.000512 0.001024 2893
Oct 27 20:57:47 unbound[124347:0] info: 0.001024 0.002048 16893
Oct 27 20:57:47 unbound[124347:0] info: 0.002048 0.004096 1888
Oct 27 20:57:47 unbound[124347:0] info: 0.004096 0.008192 1447
Oct 27 20:57:47 unbound[124347:0] info: 0.008192 0.016384 1117
Oct 27 20:57:47 unbound[124347:0] info: 0.016384 0.032768 917
Oct 27 20:57:47 unbound[124347:0] info: 0.032768 0.065536 15749
Oct 27 20:57:47 unbound[124347:0] info: 0.065536 0.131072 48319
Oct 27 20:57:47 unbound[124347:0] info: 0.131072 0.262144 44871
Oct 27 20:57:47 unbound[124347:0] info: 0.262144 0.524288 18628
Oct 27 20:57:47 unbound[124347:0] info: 0.524288 1.000000 3358
Oct 27 20:57:47 unbound[124347:0] info: 1.000000 2.000000 21751
Oct 27 20:57:47 unbound[124347:0] info: 2.000000 4.000000 12796
Oct 27 20:57:47 unbound[124347:0] info: 4.000000 8.000000 14358
Oct 27 20:57:47 unbound[124347:0] info: 8.000000 16.000000 2962
Oct 27 20:57:47 unbound[124347:0] info: 16.000000 32.000000 538
Oct 27 20:57:47 unbound[124347:0] info: 32.000000 64.000000 230
Oct 27 20:57:47 unbound[124347:0] info: 64.000000 128.000000 44
Oct 27 20:57:47 unbound[124347:0] info: 128.000000 256.000000 105
Oct 27 20:57:47 unbound[124347:0] info: 256.000000 512.000000 74
Oct 27 20:57:47 unbound[124347:2] info: server stats for thread 2: 6998925 queries, 6787092 answers from cache, 211833 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 27 20:57:47 unbound[124347:2] info: server stats for thread 2: requestlist max 90 avg 1.15098 exceeded 0 jostled 0
Oct 27 20:57:47 unbound[124347:2] info: average recursion processing time 1.439691 sec
Oct 27 20:57:47 unbound[124347:2] info: histogram of recursion processing times
Oct 27 20:57:47 unbound[124347:2] info: [25%]=0.0793255 median[50%]=0.173985 [75%]=1.0245
Oct 27 20:57:47 unbound[124347:2] info: lower(secs) upper(secs) recursions
Oct 27 20:57:47 unbound[124347:2] info: 0.000000 0.000001 1465
Oct 27 20:57:47 unbound[124347:2] info: 0.000016 0.000032 4
Oct 27 20:57:47 unbound[124347:2] info: 0.000032 0.000064 6
Oct 27 20:57:47 unbound[124347:2] info: 0.000064 0.000128 17
Oct 27 20:57:47 unbound[124347:2] info: 0.000128 0.000256 49
Oct 27 20:57:47 unbound[124347:2] info: 0.000256 0.000512 259
Oct 27 20:57:47 unbound[124347:2] info: 0.000512 0.001024 2941
Oct 27 20:57:47 unbound[124347:2] info: 0.001024 0.002048 16925
Oct 27 20:57:47 unbound[124347:2] info: 0.002048 0.004096 1879
Oct 27 20:57:47 unbound[124347:2] info: 0.004096 0.008192 1394
Oct 27 20:57:47 unbound[124347:2] info: 0.008192 0.016384 1295
Oct 27 20:57:47 unbound[124347:2] info: 0.016384 0.032768 821
Oct 27 20:57:47 unbound[124347:2] info: 0.032768 0.065536 15737
Oct 27 20:57:47 unbound[124347:2] info: 0.065536 0.131072 48316
Oct 27 20:57:47 unbound[124347:2] info: 0.131072 0.262144 45231
Oct 27 20:57:47 unbound[124347:2] info: 0.262144 0.524288 18607
Oct 27 20:57:47 unbound[124347:2] info: 0.524288 1.000000 3385
Oct 27 20:57:47 unbound[124347:2] info: 1.000000 2.000000 22190
Oct 27 20:57:47 unbound[124347:2] info: 2.000000 4.000000 12926
Oct 27 20:57:47 unbound[124347:2] info: 4.000000 8.000000 14583
Oct 27 20:57:47 unbound[124347:2] info: 8.000000 16.000000 3050
Oct 27 20:57:47 unbound[124347:2] info: 16.000000 32.000000 292
Oct 27 20:57:47 unbound[124347:2] info: 32.000000 64.000000 99
Oct 27 20:57:47 unbound[124347:2] info: 64.000000 128.000000 42
Oct 27 20:57:47 unbound[124347:2] info: 128.000000 256.000000 107
Oct 27 20:57:47 unbound[124347:2] info: 256.000000 512.000000 213
Oct 27 20:57:47 unbound[124347:1] info: server stats for thread 1: 7080760 queries, 6865904 answers from cache, 214856 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 27 20:57:47 unbound[124347:1] info: server stats for thread 1: requestlist max 78 avg 1.17997 exceeded 0 jostled 0
Oct 27 20:57:47 unbound[124347:1] info: average recursion processing time 1.252491 sec
Oct 27 20:57:47 unbound[124347:1] info: histogram of recursion processing times
Oct 27 20:57:47 unbound[124347:1] info: [25%]=0.0798659 median[50%]=0.17527 [75%]=1.00988
Oct 27 20:57:47 unbound[124347:1] info: lower(secs) upper(secs) recursions
Oct 27 20:57:47 unbound[124347:1] info: 0.000000 0.000001 1494
Oct 27 20:57:47 unbound[124347:1] info: 0.000016 0.000032 7
Oct 27 20:57:47 unbound[124347:1] info: 0.000032 0.000064 17
Oct 27 20:57:47 unbound[124347:1] info: 0.000064 0.000128 18
Oct 27 20:57:47 unbound[124347:1] info: 0.000128 0.000256 61
Oct 27 20:57:47 unbound[124347:1] info: 0.000256 0.000512 285
Oct 27 20:57:47 unbound[124347:1] info: 0.000512 0.001024 2994
Oct 27 20:57:47 unbound[124347:1] info: 0.001024 0.002048 17001
Oct 27 20:57:47 unbound[124347:1] info: 0.002048 0.004096 1834
Oct 27 20:57:47 unbound[124347:1] info: 0.004096 0.008192 1353
Oct 27 20:57:47 unbound[124347:1] info: 0.008192 0.016384 1271
Oct 27 20:57:47 unbound[124347:1] info: 0.016384 0.032768 835
Oct 27 20:57:47 unbound[124347:1] info: 0.032768 0.065536 15886
Oct 27 20:57:47 unbound[124347:1] info: 0.065536 0.131072 48743
Oct 27 20:57:47 unbound[124347:1] info: 0.131072 0.262144 46349
Oct 27 20:57:47 unbound[124347:1] info: 0.262144 0.524288 19202
Oct 27 20:57:47 unbound[124347:1] info: 0.524288 1.000000 3572
Oct 27 20:57:47 unbound[124347:1] info: 1.000000 2.000000 22273
Oct 27 20:57:47 unbound[124347:1] info: 2.000000 4.000000 13177
Oct 27 20:57:47 unbound[124347:1] info: 4.000000 8.000000 14508
Oct 27 20:57:47 unbound[124347:1] info: 8.000000 16.000000 3071
Oct 27 20:57:47 unbound[124347:1] info: 16.000000 32.000000 448
Oct 27 20:57:47 unbound[124347:1] info: 32.000000 64.000000 176
Oct 27 20:57:47 unbound[124347:1] info: 64.000000 128.000000 70
Oct 27 20:57:47 unbound[124347:1] info: 128.000000 256.000000 131
Oct 27 20:57:47 unbound[124347:1] info: 256.000000 512.000000 80
Oct 27 20:57:50 unbound[124347:0] error: SERVFAIL
Right now it counts the failure to lookup the nameservers as a failure that counts to the maximum failed nameserver lookups. Since the lookups are failing, it is reasonable that the query also fails. But the error message prints about nxdomains and the error text is not about connection failures. It is visible in the code that the subquery made a servfail, but not that this was due to connectivity. And the servfail is not nxdomain, but it still has to go towards failing the query itself, because the spawned lookup fails.
When looking up the addresses for nameservers, unbound encounters too many NXDOMAIN responses for those lookup and stops to avoid causing a denial of service.
Such workflow might work for a setup when unbound
serves a regular clients, but in email world, email servers querying on each incoming connection over port 25 - multiple antispam external services that are based on RBL DNS daemons protocol and there it's normal for those services to return NXDOMAIN as result of testing for IP addresses that tells email server that IP in question is listed or not as abuser/spammer. Those services (spamhaus.org, dnswl.org, barracudacentral.org, phishtank.rspamd.com, msbl.org, spamcop.net... and many others ) requires to query directly from email server and won't work if unbound
is in forwarding mode (to prevent those services from abuse as well DDoS) so unbound
don't acting as abuser but just following RBL protocol and when unbound
switching to "exceeded the maximum nameserver nxdomains" mode, it became useless.
Is there any option to turn such protection off?
That does not sound like a problem. Because an NXDOMAIN for the target query is not the issue. It is NXDOMAIN for the nameserver for the domain, for the address of the DNS nameserver. An NXDOMAIN for the target query, for the antispam service lookup, is not an issue. That can be any number of queries. So that should not be an issue, for the test of an IP address lookup.
Protection against the overload issue, is not something that should be optional.
If you want to exercise control over it, a stub-zone with particular nameserver names, or nameserver addresses listed for it, in the Unbound configuration, is going to make unbound use those nameservers. And if that lists existing nameservers, it would change the nameserver lookups.
So that should not be an issue, for the test of an IP address lookup.
Looks like an issue:
Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <b.dwl-ns.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <c.dwl-ns.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <removed2avoidSubdomain-disclosure.dwl.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 07 18:14:24 unbound[24059:1] error: SERVFAIL <a.dwl-ns.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <d.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <e.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <removed2avoidSubdomain-disclosure.spamhaus.com. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <removed2avoidSubdomain-disclosure.dbl.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 00:23:27 unbound[4185:0] error: SERVFAIL <b.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <d.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <e.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <removed2avoidSubdomain-disclosure.dbl.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
Mar 08 01:10:55 unbound[4185:0] error: SERVFAIL <a.gns.spamhaus.org. A IN>: exceeded the maximum nameserver nxdomains
If you want to exercise control over it, a stub-zone with particular nameserver names
Some antispam services rollover RBL domains to hide them from spammers, so it not always known upfront what domain(s) should be added in stub-zone unfortunately
BTW, are these servfails
below are related to the same case?
Mar 01 10:20:20 unbound[8110:3] info: 127.0.0.1 1.0.0.127.bl.spameatingmonkey.net. A IN SERVFAIL 0.068697 0 62
Mar 01 10:20:25 unbound[8110:0] info: 127.0.0.1 1.0.0.127.rep.mailspike.net. A IN SERVFAIL 0.073550 0 56
Mar 01 10:20:57 unbound[8110:1] info: 127.0.0.1 removed2avoidSubdomain-disclosure.multi.surbl.org. A IN SERVFAIL 0.072373 0 62
Mar 01 10:21:11 unbound[8110:0] info: 127.0.0.1 removed2avoidSubdomain-disclosure.uribl.spameatingmonkey.net. A IN SERVFAIL 0.070020 0 64
Mar 01 10:21:59 unbound[8110:0] info: 127.0.0.1 removed2avoidSubdomain-disclosure.ebl.msbl.org. A IN SERVFAIL 0.263217 0 60
Mar 01 10:22:30 unbound[8110:0] info: 127.0.0.1 removed2avoidSubdomain-disclosure.email.rspamd.com. A IN SERVFAIL 0.071349 0 59
and SERVFAIL
like this (XXX.XXX.XXX.XXX is replacement for real IP):
Mar 08 02:42:25 unbound[4185:3] error: SERVFAIL <XXX.XXX.XXX.XXX.bl.score.senderscore.com. A IN>: all servers for this domain failed, at zone bl.score.senderscore.com.
Mar 08 02:42:25 unbound[4185:3] info: 127.0.0.1 XXX.XXX.XXX.XXX.bl.score.senderscore.com. A IN SERVFAIL 0.365197 0 69
The only cure is to restart unbound
, but on busy servers it isn't solution at all
No the IP address lookup in the antispam zone is not the issue, that that answer is NXDOMAIN. The issues from the logs are that the nameservers are NXDOMAIN. So it is about the number of nameserver addresses that are NXDOMAIN.
Awkward that a stub zone then does not work, if the contents change. This removes the workaround option.
The issue looks like the actual zone configuration has a lot of NXDOMAIN nameserver addresses and this triggers the protection measures that does not allow it. The large number of NXDOMAINs for nameserver addresses is not allowed by the protection measure. This is certainly an unusual configuration.
Now we are talking about the protection measure and if the configuration of the nameservers for the zone is reasonable. Because the resolver has to choose, to protect against too many NXDOMAIN lookups, or resolve this zone. Security is most important, so that means the zone does not resolve. The zone was probably set up before this issue was reported, but now that a large number of NXDOMAINs is rejected, the set up is not something the resolver can accept.
It would be nice to be able, somehow, to figure out that this is a legitimate zone. Humorously, just like the antispam address lookup is also trying to figure out. But I do not see an easy way for that.
The issues from the logs are that the nameservers are NXDOMAIN.
But those aren't NXDOMAIN after unbound
been restated. Those (as well any other that frequently queried) became NXDOMAIN after some period of time of unbound
use (it might be an hour or less on busy days as well resolving might work fine for a day or two without need to restart on not so busy servers). As far unbound
restarted, it works as expected again... for a short period of time and recognizing the same domains without thinking those are NXDOMAIN.
This is certainly an unusual configuration... the set up is not something the resolver can accept.
unbound
on email servers working in non forwarding mode and use root-hints: "/etc/unbound/named.cache"
where named.cache
updated once per month (if their are some changes) from https://www.internic.net/domain/named.cache. There no any "privacy/countries" blocklists loaded, just a few records (3 to 5) are added to local-zone:
as always_nxdomain
. The server itself where unbound working, is static:
local-zone: "mx.domain.com." static
local-data: "mx.domain.com. 900 IN A 1.2.3.4"
local-data-ptr: "1.2.3.4 900 mx.domain.com."
0x20 is disabled, DNSSEC as well disabled too with module-config: "iterator"
(that helps a little but didn't resolved the resolving issue)
Nothing special in configuration that might trigger SERVFAIL. Older versions (1.4.x) of unbound
in the same subnet are working fine, but 1.9.0 (on debian 10) is constantly failing. Also, another group of servers that uses PowerDNS aren't affected with such issues.
Security is most important, so that means the zone does not resolve.
The funny things is that some email servers used for security notification but failing to send emails due to they can't resolve MX of receiving servers, so it turned into "security for DNS", but fail to send actual security accidents. BTW, receiver's MX are on registrar-servers.com where TTL set to 5 minutes that works pretty stable and resolves without any issues if one would try to resolve querying directly registrar-servers.com NS servers while unbound
is choking if it got too many queries for the same zone.
Kind of the same situation with unbound
in pfSense if those used in heavy loaded environment and the only cure then is to switch unbound
into forwarding mode, then "the issue" is disappearing, but it is not possible to do with email servers since RBL won't accept queries from ISP's DNS as well 8.8.8.8 and "friends"
From the list of responses, I see that the state of the answers changes. So the names change from working to not working. I think it is strange that the older unbound version is working, but the newer one is not. Perhaps qname minimisation, an option that is on by default since 1.8 or so, is causing the servers to give wrong answers. Specifically, the higher NXDOMAIN answer is then turned into an NXDOMAIN for other names. This means the servers are not compliant with the standards spec that says they should not return NXDOMAIN for those nodes.
Qname minimisation can be turned off, qname-minimisation: no
. If perhaps the qname-minimisation-strict option was turned on, perhaps turn that off. If it was this problem, then this explains how nameservers that get affected would after some period of working have an nxdomain for their address, and then also for other queries nxdomain for the address is returned, once the cache has received certain, higher level, nxdomains in cache.
Thank you for the suggestion, yes, qname-minimisation
it turned on on those servers. I will turn it off and will update if it resolves the issue
Unfortunately nothing helps. unbound
can not be used on email servers where antispam solutions utilizing well known online services which use DNS to answer, - if IP registered in spam database or not, and answering with nxdomain
for clean IPs. To eliminate possible network disruption in tests, we queried in parallel using another resolver that confirms that remote DNS is working properly.
So, perhaps if 'harden-below-nxdomain' is turned off with the config, then the nameservers do not become nxdomains later on. That as a solution would indicate that the servers serving the nameserver records are sending nxdomain for intermediate labels, a protocol failure, perhaps it is a custom script or so. Or if the entire domain does not exist, I mean, the normal internet has intermediate nxdomains above these names, and the names are specifically entered by configuration, in which case there could be more recent fixes, but also turning off harden-below-nxdomain could alleviate that problem.
same issue, even if turn off harden-below-nxdomain and qname-minimisation to no. The problem still exists。The Unbound service needs to be restarted to restore functionality.
I'm using unbound 1.19
Error logs
Apr 09 18:09:14 unbound[26254:3] reply: 127.0.0.1 ipac.ctnsnet.com. A IN SERVFAIL 0.002092 0 45
Apr 09 18:09:14 unbound[26254:15] error: SERVFAIL <grow.nowcoder.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:15] reply: 127.0.0.1 grow.nowcoder.com. A IN SERVFAIL 0.001150 0 46
Apr 09 18:09:14 unbound[26254:18] error: SERVFAIL <hb-api.omnitagjs.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:18] reply: 127.0.0.1 hb-api.omnitagjs.com. A IN SERVFAIL 0.001932 0 49
Apr 09 18:09:14 unbound[26254:6] error: SERVFAIL <video.goofish.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:6] reply: 127.0.0.1 video.goofish.com. A IN SERVFAIL 0.002066 0 46
Apr 09 18:09:14 unbound[26254:13] error: SERVFAIL <s.company-target.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:13] reply: 127.0.0.1 s.company-target.com. A IN SERVFAIL 0.001939 0 49
Apr 09 18:09:14 unbound[26254:1a] error: SERVFAIL <kgnop3.kugou.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:1a] reply: 127.0.0.1 kgnop3.kugou.com. A IN SERVFAIL 0.002232 0 45
Apr 09 18:09:14 unbound[26254:3] error: SERVFAIL <mon11-misc-lf.fqnovel.com. A IN>: exceeded the maximum number of sends
Apr 09 18:09:14 unbound[26254:3] reply: 127.0.0.1 mon11-misc-lf.fqnovel.com. A IN SERVFAIL 0.001992 0 54
Are there any other solutions?
when i set the verbosity to 3 , I found the debug logs below
Apr 09 18:57:37 unbound[9338:c] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:c] error: SERVFAIL <pull-flv-l11-cny.douyincdn.com. A IN>: exceeded the maximum number of sends
Apr 09 18:57:37 unbound[9338:8] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:8] error: SERVFAIL <res.wx.qq.com. A IN>: exceeded the maximum number of sends
Apr 09 18:57:37 unbound[9338:d] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:d] error: SERVFAIL <bd-adaptive-pull.video-voip.com.a.bcelive.com. A IN>: exceeded the maximum number of sends
Apr 09 18:57:37 unbound[9338:c] debug: request has exceeded the maximum number of sends with 33
Apr 09 18:57:37 unbound[9338:c] error: SERVFAIL <pull-hs-spe-f5.douyinliving.com. A IN>: exceeded the maximum number of sends
Output with verbosity 4 may be even more useful. And in particular show where those 33 packets are sent. And why would they not have an answer, this is perhaps visible in the logs. If you have IPv6 enabled but have no IPv6 connectivity, perhaps set do-ip6: no
, that could save up half of those attempts. The value itself is configurable, max-sent-count: 32
is the default.
@wcawijngaards when i set the verbosity to 4 , I get the verbose log as below
Apr 09 20:28:35 unbound[16386:5] info: average recursion processing time 12.093231 sec
Apr 09 20:28:35 unbound[16386:5] info: histogram of recursion processing times
Apr 09 20:28:35 unbound[16386:5] info: [25%]=2.03448 median[50%]=6.94737 [75%]=21.4737
Apr 09 20:28:35 unbound[16386:5] info: lower(secs) upper(secs) recursions
Apr 09 20:28:35 unbound[16386:5] info: 0.000000 0.000001 1
Apr 09 20:28:35 unbound[16386:5] info: 0.008192 0.016384 1
Apr 09 20:28:35 unbound[16386:5] info: 0.016384 0.032768 1
Apr 09 20:28:35 unbound[16386:5] info: 0.032768 0.065536 1
Apr 09 20:28:35 unbound[16386:5] info: 0.065536 0.131072 3
Apr 09 20:28:35 unbound[16386:5] info: 0.131072 0.262144 6
Apr 09 20:28:35 unbound[16386:5] info: 0.262144 0.524288 5
Apr 09 20:28:35 unbound[16386:5] info: 0.524288 1.000000 7
Apr 09 20:28:35 unbound[16386:5] info: 1.000000 2.000000 17
Apr 09 20:28:35 unbound[16386:5] info: 2.000000 4.000000 29
Apr 09 20:28:35 unbound[16386:5] info: 4.000000 8.000000 19
Apr 09 20:28:35 unbound[16386:5] info: 8.000000 16.000000 18
/exceed
Apr 09 20:28:37 unbound[16386:9] debug: svcd callbacks end
Apr 09 20:28:37 unbound[16386:9] debug: serviced_delete
Apr 09 20:28:37 unbound[16386:9] debug: serviced send timer
Apr 09 20:28:37 unbound[16386:9] debug: EDNS lookup known=0 vs=0
Apr 09 20:28:37 unbound[16386:9] debug: serviced query UDP timeout=376 msec
Apr 09 20:28:37 unbound[16386:9] debug: inserted new pending reply id=c516
Apr 09 20:28:37 unbound[16386:9] debug: opened UDP if=0 port=8726
Apr 09 20:28:37 unbound[16386:9] error: udp connect failed: Network is unreachable for 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:37 unbound[16386:9] debug: svcd callbacks start
Apr 09 20:28:37 unbound[16386:9] debug: worker svcd callback for qstate 0x1a4b240
Apr 09 20:28:37 unbound[16386:9] debug: mesh_run: start
Apr 09 20:28:37 unbound[16386:9] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:37 unbound[16386:9] info: iterator operate: query gs-loc.apple.com. A IN
Apr 09 20:28:37 unbound[16386:9] info: iterator operate: chased to bluedot.is.autonavi.com.gds.alibabadns.com. A IN
Apr 09 20:28:37 unbound[16386:9] debug: process_response: new external response event
Apr 09 20:28:37 unbound[16386:9] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:37 unbound[16386:9] debug: query response was timeout
Apr 09 20:28:37 unbound[16386:9] debug: iter_handle processing q with state QUERY TARGETS STATE
Apr 09 20:28:37 unbound[16386:9] info: processQueryTargets: gs-loc.apple.com. A IN
Apr 09 20:28:37 unbound[16386:9] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 65
Apr 09 20:28:37 unbound[16386:9] debug: request has exceeded the maximum number of sends with 65
Apr 09 20:28:37 unbound[16386:9] debug: return error response SERVFAIL
Apr 09 20:28:37 unbound[16386:9] debug: mesh_run: iterator module exit state is module_finished
Apr 09 20:28:37 unbound[16386:9] error: SERVFAIL <gs-loc.apple.com. A IN>: exceeded the maximum number of sends
Apr 09 20:28:37 unbound[16386:9] debug: query took 2.577494 sec
Apr 09 20:28:37 unbound[16386:9] reply: 127.0.0.1 gs-loc.apple.com. A IN SERVFAIL 2.577494 0 45
Apr 09 20:28:37 unbound[16386:9] debug: query took 7.657016 sec
Apr 09 20:28:37 unbound[16386:9] reply: 127.0.0.1 gs-loc.apple.com. A IN SERVFAIL 7.657016 0 45
Apr 09 20:28:37 unbound[16386:9] debug: query took 12.572124 sec
Apr 09 20:28:37 unbound[16386:a] info: 1vRDCD mod0 ns4.zdns.google. A IN
Apr 09 20:28:37 unbound[16386:a] info: 2vRDCD mod0 ns1.zdns.google. AAAA IN
Apr 09 20:28:37 unbound[16386:0] debug: svcd callbacks start
Apr 09 20:28:37 unbound[16386:0] debug: worker svcd callback for qstate 0x2efa8e0
Apr 09 20:28:37 unbound[16386:0] debug: mesh_run: start
Apr 09 20:28:37 unbound[16386:0] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:37 unbound[16386:0] info: iterator operate: query ns3.yahoo.com. AAAA IN
Apr 09 20:28:37 unbound[16386:0] debug: process_response: new external response event
Apr 09 20:28:37 unbound[16386:0] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:37 unbound[16386:5] info: k.gtld-servers.net. * A AAAA
Apr 09 20:28:37 unbound[16386:5] info: l.gtld-servers.net. * A AAAA
Apr 09 20:28:37 unbound[16386:5] info: h.gtld-servers.net. * A AAAA
/exceed
Apr 09 20:28:39 unbound[16386:d] info: 0vRDCDd mod0 ns-383.awsdns-47.com. A IN
Apr 09 20:28:39 unbound[16386:1] info: [25%]=6.33333 median[50%]=97.1294 [75%]=156.055
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip6 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:2] error: udp connect failed: Network is unreachable for 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:3] info: sending query: tm1.edgedns-tm.info. A IN
Apr 09 20:28:39 unbound[16386:3] debug: sending to target: <info.> 2001:500:1a::1#53
Apr 09 20:28:39 unbound[16386:3] debug: dnssec status: not expected
Apr 09 20:28:39 unbound[16386:3] debug: mesh_run: iterator module exit state is module_wait_reply
Apr 09 20:28:39 unbound[16386:3] info: average recursion processing time 56.002378 sec
Apr 09 20:28:39 unbound[16386:3] info: histogram of recursion processing times
Apr 09 20:28:39 unbound[16386:e] debug: rtt=376
Apr 09 20:28:39 unbound[16386:e] debug: servselect ip6 2001:503:39c1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:e] debug: rtt=376
Apr 09 20:28:39 unbound[16386:e] debug: servselect ip6 2001:501:b1f9::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:e] debug: rtt=376
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.43.172.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.35.51.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:503:a83e::2:30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.5.6.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:503:eea3::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.42.93.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.54.112.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:500:d937::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.41.162.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:503:d2d::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.52.178.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:503:83eb::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.26.92.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: ip6 2001:500:856e::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:4] debug: ip4 192.31.80.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:4] debug: rpz: iterator module callback: have_rpz=0
Apr 09 20:28:39 unbound[16386:6] debug: rtt=376
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip6 2001:500:d937::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:6] debug: rtt=376
Apr 09 20:28:39 unbound[16386:2] debug: svcd callbacks start
Apr 09 20:28:39 unbound[16386:2] debug: worker svcd callback for qstate 0x10246a0
Apr 09 20:28:39 unbound[16386:2] debug: mesh_run: start
Apr 09 20:28:39 unbound[16386:2] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:39 unbound[16386:2] info: iterator operate: query ssl.gstatic.com. A IN
Apr 09 20:28:39 unbound[16386:3] info: [25%]=8.66667 median[50%]=55.6364 [75%]=92.7619
Apr 09 20:28:39 unbound[16386:5] error: udp connect failed: Network is unreachable for 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.33.14.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:5] debug: svcd callbacks start
Apr 09 20:28:39 unbound[16386:5] debug: worker svcd callback for qstate 0x14658f0
Apr 09 20:28:39 unbound[16386:5] debug: mesh_run: start
Apr 09 20:28:39 unbound[16386:5] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply
Apr 09 20:28:39 unbound[16386:5] info: iterator operate: query api.bilibili.com. A IN
Apr 09 20:28:39 unbound[16386:5] info: iterator operate: chased to a.w.bilicdn1.com. A IN
Apr 09 20:28:39 unbound[16386:5] debug: process_response: new external response event
Apr 09 20:28:39 unbound[16386:5] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:39 unbound[16386:5] debug: query response was timeout
Apr 09 20:28:39 unbound[16386:5] debug: iter_handle processing q with state QUERY TARGETS STATE
Apr 09 20:28:39 unbound[16386:5] info: processQueryTargets: api.bilibili.com. A IN
Apr 09 20:28:39 unbound[16386:5] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 65
Apr 09 20:28:39 unbound[16386:5] debug: request has exceeded the maximum number of sends with 65
Apr 09 20:28:39 unbound[16386:5] debug: return error response SERVFAIL
Apr 09 20:28:39 unbound[16386:5] debug: mesh_run: iterator module exit state is module_finished
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.43.172.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug: rtt=1034
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.55.83.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug: rtt=946
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.48.79.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug: rtt=1539
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.33.14.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug: rtt=1409
Apr 09 20:28:39 unbound[16386:9] debug: servselect ip4 192.12.94.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:9] debug: rtt=900
Apr 09 20:28:39 unbound[16386:9] debug: selrtt 376
Apr 09 20:28:39 unbound[16386:9] info: sending query: drive.wpsdns.com. A IN
Apr 09 20:28:39 unbound[16386:9] debug: sending to target: <com.> 2001:503:39c1::30#53
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:502:7094::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.48.79.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:501:b1f9::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.55.83.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:503:39c1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.43.172.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:503:d414::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.35.51.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:503:a83e::2:30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.5.6.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:503:eea3::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.42.93.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:502:8cc::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.54.112.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:500:d937::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.41.162.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:503:d2d::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.52.178.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:503:83eb::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.26.92.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: ip6 2001:500:856e::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: ip4 192.31.80.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:f] debug: rpz: iterator module callback: have_rpz=0
Apr 09 20:28:39 unbound[16386:f] debug: servselect ip6 2001:502:1ca1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: rtt=376
Apr 09 20:28:39 unbound[16386:f] debug: servselect ip6 2001:503:231d::2:30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:f] debug: rtt=376
Apr 09 20:28:39 unbound[16386:f] debug: servselect ip6 2001:502:7094::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:3] info: 0.000000 0.000001 2
Apr 09 20:28:39 unbound[16386:3] info: 0.001024 0.002048 1
Apr 09 20:28:39 unbound[16386:3] info: 0.002048 0.004096 1
Apr 09 20:28:39 unbound[16386:a] info: h.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: g.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: a.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: f.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: i.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: m.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: j.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: b.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:a] info: e.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:1] info: lower(secs) upper(secs) recursions
Apr 09 20:28:39 unbound[16386:a] debug: ip6 2001:502:1ca1::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:0] info: sending query: ns3.yahoo.com. AAAA IN
Apr 09 20:28:39 unbound[16386:0] debug: sending to target: <com.> 2001:503:39c1::30#53
Apr 09 20:28:39 unbound[16386:0] debug: dnssec status: not expected
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip6 2001:501:b1f9::30 port 53 (len 28)
Apr 09 20:28:39 unbound[16386:6] debug: rtt=376
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip4 192.33.14.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:6] debug: rtt=1363
Apr 09 20:28:39 unbound[16386:6] debug: servselect ip4 192.55.83.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:2] debug: iter_handle processing q with state QUERY RESPONSE STATE
Apr 09 20:28:39 unbound[16386:2] debug: query response was timeout
Apr 09 20:28:39 unbound[16386:2] debug: iter_handle processing q with state QUERY TARGETS STATE
Apr 09 20:28:39 unbound[16386:2] info: processQueryTargets: ssl.gstatic.com. A IN
Apr 09 20:28:39 unbound[16386:2] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 8
Apr 09 20:28:39 unbound[16386:2] info: DelegationPoint<com.>: 13 names (0 missing), 26 addrs (26 result, 0 avail) cacheNS
Apr 09 20:28:39 unbound[16386:2] info: d.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:2] info: c.gtld-servers.net. * A AAAA
Apr 09 20:28:39 unbound[16386:c] info: 14vRDCDd mod0 h5-f5gtm01-lsnr02.eset.com. AAAA IN
Apr 09 20:28:39 unbound[16386:c] info: 15RDdc mod0 rep dmd.metaservices.microsoft.com. A IN
Apr 09 20:28:39 unbound[16386:d] info: 1vRDCDd mod0 ns-383.awsdns-47.com. AAAA IN
Apr 09 20:28:39 unbound[16386:d] info: 2vRDCDd mod0 n6dsce9.akamaiedge.net. AAAA IN
Apr 09 20:28:39 unbound[16386:a] debug: ip4 192.12.94.30 port 53 (len 16)
Apr 09 20:28:39 unbound[16386:d] info: 3vRDCDd mod0 n1dscgi3.akamai.net. AAAA IN
Apr 09 20:28:39 unbound[16386:d] info: 4vRDCDd mod0 n6dscgi3.akamai.net. AAAA IN
Apr 09 20:28:39 unbound[16386:d] info: 5RDd mod0 rep w2.kskwai.com. A IN
Apr 09 20:28:39 unbound[16386:0] debug: mesh_run: iterator module exit state is module_wait_reply
Apr 09 20:28:39 unbound[16386:0] info: mesh_run: end 12 recursion states (6 with reply, 0 detached), 13 waiting replies, 208 recursion replies sent, 0 replies dropped, 0 states jostled out
Adjust max-sent-count to 64,but the problem still exist. From the log, it can be seen that '208 recursion replies were sent'. I don't know if this is related to max-sent-count.
I tried modifying max-sent-count to 256 and conducted another test. I found that the probability of failure decreased significantly, with only a few cases where the domain name resolution results could not be obtained. However, these few unresolvable cases are acceptable. Therefore, increasing max-sent-count can indeed alleviate the problem I mentioned above. But I'm not sure if this is a best practice, but from the perspective of user experience, it does seem to work better.
Are there any other solutions?
Yes, other resolvers.
Unfortunately unbound
is unusable on servers and border gateways due to hard-coding on number of sends. Because of this, we removed it from all servers.
The interesting failure in the logs is udp connect failed: Network is unreachable for 2001:503:d414::30
. It seems that udp-connect: no
could be a solution. Or are there IPv6 routing problems, and perhaps do-ip6: no
is a solution.
So, what makes it fail to work, on servers and border gateways? The maximum number of sends is configurable, nowadays, and you can increase it if you like, with max-sent-count:
. But why would that fail so hard, I mean 30 packets is a lot, most DNS queries should work in only several. The counter is the resolver trying to send a packet, and somehow that fails a lot.
But why would that fail so hard, I mean 30 packets is a lot, most DNS queries should work in only several.
If unbound supposed to work on a single office computer that serves a regular user and all it does is substitute name=>IP, then yes, you right, but for the servers or specific gateways, overriding remote NS replies - it breaks protocol and workflow that expects huge amount of NXDOMAIN.
I already tried to explain it, as of now, instead of to continue delivery to DNS consumer exact NS server's reply, such as NXDOMAIN
, unbound
trying to play to be a good client that "protect the world from bad a guys", that trying to avoid to abuse remote NS, and after certain amount of queries converts NXDOMAIN
into SERVFAIL
which isn't true. In fact, on email servers, it is basically is the same as a MITM - that disables email servers to classify connecting domains - if those are spammers or not and in the end unbound
helping spammers to sneak their dirty messages. unbound
as a client trying to do servers/infrastructure job of those who exposed NS service and have to protect itself on its own from DDoS.
Overriding NXDOMAIN
to SERVFAULT
breaking antispam services that works for decades over DNS, where NXDOMAIN
is just a flag, that domain in question either is NX (non exist in their spam list) or resolves it to some 127.x.x.x reply classifying those as a spammers, but unbound
simply eliminate such services from doing there jobs.
max-sent-count
obviously is a pretty good choice for a public services (hospitals, libraries, campuses...) that will protect from malicious users/app to abuse NS servers, but it can't help for the cases described above, on email servers (of science/programming facilities) who has a legal reasons to query remote NS and expects to get from remote NS exact replies instead of decision made by intermediate cashing forwarder.
Im sorry upfront for a long and expressive post, but the only intent of this - is to help to understanding the issue.
No, it is not that 30 nxdomain's is an issue or that unbound refuses nxdomain responses. What is also awkward is that the original post, and the person that had a problem just before you commented and you are talking about different things. The original poster talks about the error of exceeding the nameserver nxdomains, the previous poster has trouble sending packets and hits limits on number of sent packets, and it seems you would like anti spam domains to resolve correctly.
Unbound of course wants to store the NXDOMAIN responses and return them to the querier. It also wants to reply with the 127.x responses that antispam domains use. For that it performs lookups to get that data.
The maximum nameserver nxdomains exceeded error, is an error from the recursion, where nameserver lookups, not the target domain name, but a nameserver, or a nameserver for a nameserver (recursively) has an nxdomain answer. This is unexpected since nameservers, by their purpose, have addresses. It is also not the target query that the client makes. There was a limit for overload prevention set on the number of failed nameservers, and this is what the nameserver nxdomain exceeded error is about. It is not actually about the target query, if its is NXDOMAIN of 127.x, just its nameservers.
So unbound is copying the nameserver reply to the client, for that target query. And even, also for nameserver lookups, would store for the nameserver lookup itself that it was nxdomain for example. But it stops after too much work is expended.
Both the max number of nameserver nxdomains and max sent count are maximums on amount of work expended. They are set to be reasonable, eg. normal domains should resolve. Also anti spam domains should work just fine.
If the limits are exceeded unbound returns SERVFAIL, the resolver has failed to look up the name. It is not trying to change the answer, but reporting that it could not do the lookup.
So maybe we can figure out, since the complaint is that something is wrong, what exactly is going wrong? That must be different stuff, it seems to me, because of different errors, and perhaps those are different issues that are talked about. So increase verbosity and perhaps it can reveal what happens those countless times. The inability to send packets should stop any server from functioning, and also getting no address for nameserver lookups. Possibly also, there are bugs in the code, that prevent proper functioning, or create these failures, but I have not been able to find any (in this ticket report).
The maximum nameserver nxdomains exceeded error, is an error from the recursion,
No. It doesn't work like a regular DNS query. To be able to prevent abuse and commercialize it, most of antispam databases that utilize DNS mechanism - requires to query them directly, from specific, own email server's static IP. If one would use intermediate recursor, like 8.8.8.8 or ISP's DNS, then those queries will be rejected by antispam database providers.
I don't know how to explain it more clearly, - NXDOMAIN
reply is NOT an error, but flag, that indicates that queried domain is NOT EXISTS in spam list. It isn't related to a common DNS name<=>IP resolution
But it stops after too much work is expended. Both the max number of nameserver nxdomains and max sent count are maximums on amount of work expended
And that's why the only unbound
can not be used on email servers. unbound
decides when it will stop working, even so, workflow described above EXPECTS to get huge amount of NXDOMAIN
from the same, specific DNS servers and even so it would be the same query sent many times to them. Remote servers in such cases aren't abused, it just how it works: remote DNS server reply with NX in case queried domain isn't listed as a spammer, but it doesn't means that domain in question won't be banned in a next hour if they might start sending spam, and that's why it is Ok to query the same domain multiple times, often, even so it was NXDOMAIN for a last hundreds/thousands queries before.
If the limits are exceeded unbound returns SERVFAIL
UNBOUNDFAIL - would be more appropriate response, not a SERVFAIL
SERVFAIL
- should means that unbound
can't talk anymore to a server because of SERVER get really FAILED, not because a DNS client decides that it get tired, and interpret NXDOMAIN
in its own way and refuse to do its job because of some unbound
only hard-coded limit get reached.
So maybe we can figure out, since the complaint is that something is wrong, what exactly is going wrong?
Wrong is: DNS client must simply do its job without assumption that multiple NXDOMAIN
answers are equal to SERVFAIL
. unbound
shouldn't be a judge also assuming that remote server get abused and return to queerer false DNS reply SERVFAIL even when server still alive and reply with appropriate DNS answers, even those answers are NXDOMAIN
million times for the same query over short period of time.
So increase verbosity and perhaps it can reveal what happens those countless times.
There no errors between server and unbound, verbosity won't help. The error is in the unbound
that invented a new way to substitute NXDOMAIN
with SERVFAIL
after some unique to unbound only, hard-coded measuring. This is a logical error, that pop up after some amount of queries, when unbound
stopped processing queries from its clients and returns false reply SERVFAIL
while other DSN clients in comparison, - successfully continue to work.
Querying antispam databases over DNS, questioner expects to get huge amount of NXDOMAIN from specific DNS server that utilizing DNS protocol. DNS isn't just for resolving name/IP, but also other key/value pairs related to domains.
I hope that someone might check how MTA and antispam databases works for decades, that utilizing DNS, when making decision on email reception. Bellow are just a few such providers which you can check:
zen.spamhaus.org
b.barracudacentral.org
bl.spamcop.net
bl.spameatingmonkey.net
dnsbl.sorbs.net
psbl.surriel.com
bl.mailspike.net
list.dnswl.org
as well there are tens other similar providers.
The same techniques also used in corporate environments on DNS level, where reply from upstream specific DNS providers has different meaning for NXDOMAIN, that might indicate that it listed on not in some specific database, for some specific reason, for specific short time, but can be resolved to some specific reply in case it needed.
Yes, of course Unbound should be able to resolve antispam domains, hopefully without bugs.
There seems to be a misunderstanding. The nxdomain from the error versus other nxdomains. The nxdomain for an antispam lookup it not the topic of the error. There can be any number of those nxdomains. It is another type of lookup, the lookup for the address of a DNS server, that is made by unbound itself, that receives an, unexpected, nxdomain. Perhaps I can illustrate it with a short example, suppose there is a query for www.example.com, then that could be nxdomain. That is fine and does not cause the error printout, and there can be any number of those nxdomain replies. To resolve example.com names Unbound may need to, for example, lookup DNS servers, and it decides to look up the IP addresses of ns1.example.com, ns2.example.com, ns3.example.com ... These then return nxdomains, and this is what the error is about, if that happens too many times, the algorithm gives up. In an attempt to stop certain overload situations where there could be infinite such look ups.
The name SERVFAIL
is from the RFC on the protocol, and registered, it is not made up. When unbound wants to print errors, it can print more detailed ones than that code. The other commenter is already using that with log-servfail: yes
, that logs a short description of the failure. Sometimes useful to cut to the chase for like DNSSEC failures, it prints some reason for the lack of response. Also it is possible to have descriptive errors in replies, with new options like ede: yes
.
So other DNS clients work, and I wonder why, because unbound is not actually refusing the nxdomain for the queried name. There is something happening with other names, and you surmise with the overload protection mechanism that causes unbound to gives and not resolve any further.
My guess is, that there is some sort of issue around the antispam domains commercialised response method, and Unbound's protection mechanism. I mean, somehow Unbound gets upset about the nonresponse that the email confidence provider has for its DNS servers, even though I guess you have access. It could be perhaps something with the nameservers, the DNS servers for the antispam domain name, and how they respond to the registered user. I guess this because unbound complains something about the DNS server addresses and the commercialised access method, could, also make that different from an ordinary DNS resolution. But perhaps my guess is wrong, so it is fine to discuss further for investigating this issue.
There seems to be a misunderstanding.
Unfortunately it is.
The name SERVFAIL is from the RFC on the protocol, and registered, it is not made up.
According to RFC:
It isn't means a failure if remote server returns - many NXDOMAIN
. As far as it functioning according to protocol - it is still alive.
I can not understand, where this conclusion is come from - that receiving many NXDOMAIN replies can be classified as SERVFAIL. NXDOMAIN is one of standardized reply that remote NS server returning and it is THEIR AUTHORITY to decide how to answer. As far as it working according to protocol, it isn't SERVFAIL, regardless how many NXDOMAIN it returning
These then return nxdomains, and this is what the error is about, if that happens too many times, the algorithm gives up.
NXDOMAIN is not a error, it is authoritative NS reply. "too many times NXDOMAIN" is not SERVFAIL
, it is remote server reply and it is their authority to decide, not a DNS client that make conclusion - "if it's too many or not"
My guess is, that there is some sort of issue around the antispam domains commercialised response method
No. It isn't related. Any other solutions like PowerDNS, KnotDNS, CoreDNS... works with the same providers without a problem. The only unbound interpreting "many NXDOMAIN" as SERVFAIL and stopped processing query until it get restarted.
Yes it is fine for the upstream server to return NXDOMAIN for the target query.
The change from nxdomain to servfail is for the lookup of the addresses of nameservers. This where failed lookups make unbound give up and fail, and then return servfail. This is not nxdomain for the target query from the end client, but nxdomain for the addresses of name servers that unbound needs to send queries to.
I do have similar issue with similar setup: mail server with unbound which is used for RBL.
Anyway, my setup is quite small server with a few emails and issue that unbound replies as SERVFAIL happens a few times per day.
Due to that small traffic I was able to enable verbose log to capture an example.
Environment: OpenBSD 7.5 with unbound 1.18.0 which is running with config:
server:
interface: 127.0.0.1
interface: ::1
do-ip6: no
access-control: 0.0.0.0/0 refuse
access-control: 127.0.0.0/8 allow
access-control: ::0/0 refuse
access-control: ::1 allow
hide-identity: yes
hide-version: yes
auto-trust-anchor-file: "/var/unbound/db/root.key"
val-log-level: 2
verbosity: 5
log-servfail: yes
ede: yes
aggressive-nsec: yes
domain-insecure: "local."
private-domain: "local."
do-not-query-localhost: no
remote-control:
control-enable: yes
control-interface: /var/run/unbound.sock
stub-zone:
name: "bl.local."
stub-addr: 127.0.0.2
stub-zone:
name: "wl.local."
stub-addr: 127.0.0.2
I use DNSBL filter for OpenSMTPD which uses gethostbyname_async
to run multiple queries in parallel against unbound.
After enabling verbose logs at unbound I've discovered the following logs from OpenSMTPD:
Apr 20 18:37:01 mx1 smtpd[3772]: 249926c85ae61fc4 smtp connected address=194.48.251.196 host=<unknown>
Apr 20 18:38:16 mx1 smtpd[6188]: dnsbl: 249926c85ae61fc4 DNS error 2 on dnsbl.sorbs.net
Apr 20 18:38:16 mx1 smtpd[3772]: 249926c85ae61fc4 smtp disconnected reason=disconnect
Here, someone (spammer?) had connected to SMTPD which triggers series of DNS requests and in 1 minute (I guess it is default timeout for gethostbyname_async
) the error / response 2
had happened. Let me quote possible responses:
#define NETDB_INTERNAL -1 /* see errno */
#define NETDB_SUCCESS 0 /* no problem */
#define HOST_NOT_FOUND 1 /* Authoritative Answer Host not found */
#define TRY_AGAIN 2 /* Non-Authoritative Host not found, or SERVERFAIL */
#define NO_RECOVERY 3 /* Non recoverable errors, FORMERR, REFUSED, NOTIMP */
#define NO_DATA 4 /* Valid name, no data record of requested type */
#define NO_ADDRESS NO_DATA /* no address */
Ok, now unbound. I've extracted all it's logs from 18:37:00 until 18:38:59 which includes only DNS requests which was associeted with that client. As I said, I do have a little traffic. This log file is reducted: I replaced API key for one list, but it is the only changes that I did. unbound.log
I tried to understand it, but I do not understand why it continue to repeat the same request over and over.
I think I got it.
NXDOMAIN
uses by cache layer for the case when cache hasn't got it.
So, when DNSBL returns NXDOMAIN
... unbound
doesn't cache it because it means no cache, and tries another NS server for that zone, with hope that it resposnes something.
@wcawijngaards is it possible to treat NXDOMAIN
from upstream as final decision and do not ask the next NS server?
It does treat nxdomain from upstream as a final answer and then returns that to the client. Not sure what is going wrong for you. But just for getting nxdomain unbound should not be asking the next server.
@wcawijngaards see attached log where unbound repeats the same request over and over.
The logs do not have SERVFAIL in them. There is nxdomain, and I see why it makes a second query when an nxdomain. It is performing qname minimisation. And some servers wrongly report NXDOMAIN, so it checks to make sure if the full query name is also nxdomain. Perhaps qname-minimisation: no
helps out. Which request is repeated?
It repeats 196.251.48.194.dnsbl.sorbs.net.
. I'll put qname-minimisation: no
to that server. If issue appears again, will re-collect verbose log.
Can you set verbosity: 4 because a lot of information was missing, or use a logfile instead of syslog if syslog and high verbosity was the result of the previous log perhaps that dropped a lot of output. Otherwise, it looks like sorbs.net is just not responding to queries, and unbound is trying several different servers, and at last, one responds with NXDOMAIN and that is the answer that unbound uses, that is in the log.
Can you set verbosity: 4 because a lot of information was missing, or use a logfile instead of syslog if syslog and high verbosity was the result of the previous log perhaps that dropped a lot of output. Otherwise, it looks like sorbs.net is just not responding to queries, and unbound is trying several different servers, and at last, one responds with NXDOMAIN and that is the answer that unbound uses, that is in the log.
I do have the same behaviour from different DNS BL sdervers, here statistics for the last 24h:
2 all.spamrats.com
36 b.barracudacentral.org
23 bl.spamcop.net
2 combined.mail.abusix.zone
8 dnsbl.sorbs.net
2 list.dnswl.org
this numbers on my small mail server with quite less traffic.
But I assume that all of them is using the same software to run DNS: https://rbldnsd.io/
Are those servfails of some sort? Unbound has the option log-servfail: yes
that can print one line error responses, that you can inspect to see what is going on.
qname-minimisation: no
doesn't help. The settings looks like:
val-log-level: 2
# verbosity: 5
log-servfail: yes
ede: yes
qname-minimisation: no
and I just had in smtpd logs:
Apr 25 16:34:08 mx2 smtpd[21876]: dnsbl: 9618a2e89fa3d52f DNS error 2 on b.barracudacentral.org
that means that gethostbyname returns:
#define TRY_AGAIN 2 /* Non-Authoritative Host not found, or SERVERFAIL */
in parallel unbound log contains:
Apr 25 16:34:08 mx2 unbound: [59688:0] error: SERVFAIL <99.205.141.95.b.barracudacentral.org. A IN>: all servers for this domain failed, at zone b.barracudacentral.org. no server to query nameserver addresses not usable
as you asked switched to verbocity: 4
But I give the disable of qname minimisation a fair chance to fix it; since the server specifically generates data, and that may mean that the intermediate labels of the lookup names could be trouble, since that is typically something that qname minimisation targets and that may be surprising behaviour.
I wonder what causes the nameservers to be not usable for the query, the verbosity: 4 logs can maybe show what the responses look like, eg. why it is not a usable response that is returned. The verbosity level 5 is good too, 4 or higher.
The verbosity level 5 is good too, 4 or higher.
Verbosity 5 was attached, now I'm waiting until it's happened again to capture with verbosity 4.
Wait that shortened log was verbosity 5, if so, could you use the log to file function. It seems that the syslog that could be in use drops a lot of stuff, like debug category or lots of output.
Wait that shortened log was verbosity 5, if so, could you use the log to file function. It seems that the syslog that could be in use drops a lot of stuff, like debug category or lots of output.
can you share that should I put into config to do it?
With like logfile: "/root/unbound.log"
it would log to file instead of syslog and stuff like debug output or lots of output does not get dropped.
@wcawijngaards switched to:
val-log-level: 2
verbosity: 5
log-servfail: yes
ede: yes
qname-minimisation: no
logfile: "/var/unbound/db/unbound.log"
and I've got it. Mailserver logs contains:
Apr 25 18:08:57 mx1 smtpd[71942]: 9f3ec9c731c1ab0e smtp connected address=89.113.156.156 host=<unknown>
Apr 25 18:10:12 mx1 smtpd[98148]: dnsbl: 9f3ec9c731c1ab0e DNS error 2 on b.barracudacentral.org
Apr 25 18:10:12 mx1 smtpd[71942]: 9f3ec9c731c1ab0e smtp disconnected reason=disconnect
So, here the log between 1714061200
and 1714061500
: unbound-range.log.gz
Log contains UDP timeouts. What puzzles me is that I see it on both servers at random time but for the few zones for dozens.
Thus, one server in Germany, the second one in Finland.
Something very strange here
Ok, I"ve run tcpdump on one of servers for about 30 minutes, and it had proved that here almost 10% of quereis without reply.
How can I enforce TCP connection to upstream servers?
Welll... tcp-upstream: yes
doesn't help because DNSBL NS is UDP only.
Huh.
How can I enforce TCP connection to upstream servers?
The problem is that unbound
after some hardcoded number of attempts marks such remote servers as SERVFAIL. But RBL servers can be silent for multiple numbers of reasons, like currently abused/DDoS or reach limit in resources due to their nature to handle huge amount of requests or requester reached free/bought limits (per/hour, per/day) or it something else in a middle.
Other DNS clients still continues to query, regardless if there was temporary outage, but unbound
marks such as SERVFAIL until it get restarted.
In my unbound server, I found the run log like
(I use the
a.b.example.com
replace the true domain。)The questions are :
a.b.example.com
cannot find the answer,xx.example.com cannot find the answer, neither。 why ?