Closed otlich closed 2 months ago
The recursor will chain equivalent requests, I do not think clashing IDs are the issue. Chaining means that if a query for a name/type is already sent to a specific IP, no new query will be sent out, and the new client query will just wait on the answer of the earlier query coming in.
unreachable
indicates that there is a network issue on the outgoing side. Are all NS reachable from the recursor? It could be that an error reaching an NS is propagated to all chained clients. Also note that 192.168.0.10 is an IP normally in dont-query
, so the recursor wil refuse to query it.
Please share your configuration and traces.
Additionally, please also describe your test setup, as it might be that the test uses many resources on the machine recursor is running on. I'd like to rule that out.
My test configs
$ cat /etc/powerdns/recursor.conf
allow-from=127.0.0.1, 10.0.0.0/8
config-dir=/etc/powerdns
hint-file=/usr/share/dns/root.hints
include-dir=/etc/powerdns/recursor.d
local-address=0.0.0.0
local-port=53
quiet=yes
setgid=pdns
setuid=pdns
dnssec=process-no-validate
extended-resolution-errors=yes
version-string=
max-negative-ttl=5
max-cache-ttl=600
packetcache-servfail-ttl=5
log-common-errors=yes
max-tcp-clients=512
packetcache-ttl=300
reuseport=yes
pdns-distributes-queries=yes
threads=20
forward-zones=example.com=127.0.0.1:5353
$ cat /etc/powerdns/pdns.conf
local-port=5353
distributor-threads=1
consistent-backends=no
disable-axfr=yes
cache-ttl=0
query-cache-ttl=0
negquery-cache-ttl=0
zone-cache-refresh-interval=0
launch=pipe
pipe-command=/usr/local/bin/dns_pipe.py
pipe-timeout=3000
pipe-abi-version=2
loglevel=9
log-dns-queries=yes
$ cat /usr/local/bin/dns_pipe.py
#!/usr/bin/python3
import sys
import time
def resp(data):
sys.stdout.write(data+'\n')
sys.stdout.flush()
while True:
line = sys.stdin.readline().strip()
if line.startswith('HELO'):
resp('OK\tSlow backend')
if line.startswith('Q'):
time.sleep(0.05)
f = line.replace('\t',' ').replace(' ',' ').strip().split(' ')
qid = f[4]
d = f'DATA\texample.com\tIN\tSOA\t1\t{qid}\texample.com. mail.example.com. {int(time.time())} 14400 3600 604800 10800'
resp(d)
resp('END')
Test process
# rec_control trace-regex 'example\.com' /tmp/pdns_trace
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
# flame 127.0.0.1 -p 53 -r example.com -g static -c 1 -Q 1000 -l 10 -T A
binding traffic generators to 0.0.0.0
flaming target(s) [127.0.0.1] on port 53 with 1 concurrent generators, each sending 10 queries every 1ms on protocol udp
query generator [static] contains 1 record(s)
rate limit @ 1000 QPS (1000 QPS per concurrent sender)
0.993019s: send: 993, avg send: 993, recv: 991, avg recv: 991, min/avg/max resp: 0.024002/11.0094/106.433ms, in flight: 1, timeouts: 0
1.99392s: send: 999, avg send: 996, recv: 1001, avg recv: 996, min/avg/max resp: 0.02282/3.49844/103.657ms, in flight: 1, timeouts: 0
2.99342s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.023038/5.52326/106.971ms, in flight: 1, timeouts: 0
3.99365s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.023439/5.56908/106.923ms, in flight: 1, timeouts: 0
4.99321s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.022015/5.50145/106.756ms, in flight: 1, timeouts: 0
5.99385s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.024154/5.53143/106.916ms, in flight: 1, timeouts: 0
6.99366s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.02218/5.33692/105.85ms, in flight: 1, timeouts: 0
7.99399s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.021278/5.44307/105.78ms, in flight: 1, timeouts: 0
8.99413s: send: 1001, avg send: 997, recv: 1001, avg recv: 997, min/avg/max resp: 0.022024/2.91637/103.43ms, in flight: 1, timeouts: 0
9.99354s: send: 999, avg send: 997, recv: 999, avg recv: 997, min/avg/max resp: 0.023732/5.53983/106.388ms, in flight: 1, timeouts: 0
------
run id : 7ffd638a73c0
run start : 2024-09-04T19:27:14Z
runtime : 9.99465 s
total sent : 9992
total rcvd : 9992
min resp : 0.021278 ms
avg resp : 5.58692 ms
max resp : 106.971 ms
avg r qps : 997
avg s qps : 997
avg pkt : 40 bytes
tcp conn. : 0
timeouts : 0 (0%)
bad recv : 0
net errors : 0
responses :
SERVFAIL: 155
NOERROR: 9837
# grep -i error /tmp/pdns_trace | grep -vi -e 'noerror' -e 'No error'
81 example.com: Error resolving from 127.0.0.1, possible error: Success
93 example.com: Error resolving from 127.0.0.1, possible error: Success
85 example.com: Error resolving from 127.0.0.1, possible error: Success
95 example.com: Error resolving from 127.0.0.1, possible error: Success
trace text
us === START OF TRACE 2024-09-04T22:27:16.042 ===
29 QM example.com: doResolve
30 example.com: Wants DNSSEC processing, auth data required by query for A
32 QM example.com: Step0 Not cached
33 QM example.com: Step0 qname is in a forwarded domain example.com
33 QM example.com: Step2 New child example.com
34 QM example.com: Step3 Going to do final resolve
34 example.com: Wants DNSSEC processing, auth data required by query for A
36 example.com: No cache hit for 'example.com|A', trying to find an appropriate NS record
37 example.com: Checking if we have NS in cache for 'example.com'
37 example.com: No valid/useful NS in cache for 'example.com'
38 example.com: Checking if we have NS in cache for 'com'
39 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'l.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
40 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'g.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
41 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'm.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
43 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'b.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
44 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'c.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
45 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'k.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
47 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'h.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
49 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'e.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
50 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'a.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
51 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'f.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
53 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'd.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
54 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'j.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
56 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'i.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
61 example.com: We have NS in cache for 'com' (flawedNSSet=0)
61 example.com authOrForwDomain: example.com nsFromCacheDomain: com isPartof: 1
62 example.com: Using forwarder as NS
63 example.com: Cache consultations done, have 1 NS to contact
65 example.com: Domain has hardcoded nameserver
67 example.com: Nameservers: -127.0.0.1:5353(81.50ms)
70 example.com: Resolved 'example.com' NS (empty) to: 127.0.0.1
71 example.com: Trying IP 127.0.0.1:5353, asking 'example.com|A'
81 example.com: Error resolving from 127.0.0.1, possible error: Success
84 example.com: Failed to resolve via any of the 1 offered NS at level 'example.com'
86 example.com: Failed (res=-1)
87 QM example.com: Step3 Final resolve: Server Failure/0
About possible error
, it is different on different servers and sometimes changes.
Can you also provide pdns_recursor --version
just to confirm the full version and build
Thanks for the info. I'll try to reproduce, but that will be the day after tomorrow at the earliest.
other general comments
reuseport=yes
pdns-distributes-queries=yes
threads=20
the first 2 are conflicting ideas (distribute queries would help with flame test from a single source though). the second is threads=20
is way too high. start with something more reasonable and then go up if you need to. for reference you can do like 160kqps with only 8 pretty easily.
A thing to read: https://blog.powerdns.com/2024/06/05/powerdns-recursor-5-0-6-released
The advice there also holds for 4.9.x and especially if you use a large value for threads
.
If I disable pdns-distributes-queries
and use reuseport
I won't have these errors but will have lots of outgoing requests. Another way to solve this problem is to enable serve-stale-extensions
.
Yes, it looks like a network issue but it isn't.
# pdns_recursor --version
Sep 04 23:22:05 PowerDNS Recursor 4.9.6 (C) 2001-2022 PowerDNS.COM BV
Sep 04 23:22:05 Using 64-bits mode. Built using gcc 12.2.0 on May 13 2024 11:19:52 by root@localhost.
Sep 04 23:22:05 PowerDNS comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2.
Sep 04 23:22:05 Features: fcontext libcrypto-ecdsa libcrypto-ed25519 libcrypto-ed448 libcrypto-eddsa lua nod protobuf dnstap-framestream snmp sodium curl DoT scrypt
Sep 04 23:22:05 Configured with: " '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--sysconfdir=/etc/powerdns' '--enable-systemd' '--with-systemd=/lib/systemd/system' '--enable-unit-tests' '--disable-silent-rules' '--with-service-user=pdns' '--with-service-group=pdns' '--with-libcap' '--with-libsodium' '--with-lua' '--with-net-snmp' '--enable-dns-over-tls' '--enable-dnstap' '--enable-nod' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/pdns/pdns-recursor-4.9.6=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/pdns/pdns-recursor-4.9.6=. -fstack-protector-strong -Wformat -Werror=format-security'"
I think happening is this:
asyncresolve
gen new ID for request https://github.com/PowerDNS/pdns/blob/9e58e58718b1bfc61c8a47a9f676a02b8d7b39fd/pdns/recursordist/lwres.cc#L396asyncresolve
start arecvfrom
https://github.com/PowerDNS/pdns/blob/3264365937199e20ad4edec245a79501e763562e/pdns/recursordist/lwres.cc#L505arecvfrom
add this ID in to pident->id
https://github.com/PowerDNS/pdns/blob/3264365937199e20ad4edec245a79501e763562e/pdns/recursordist/pdns_recursor.cc#L357arecvfrom
execute g_multiTasker->waitEvent
with this pident
https://github.com/PowerDNS/pdns/blob/3264365937199e20ad4edec245a79501e763562e/pdns/recursordist/pdns_recursor.cc#L363 waitEvent
check uniq key (pident)
https://github.com/PowerDNS/pdns/blob/9e58e58718b1bfc61c8a47a9f676a02b8d7b39fd/pdns/mtasker.cc#L175-L177If different requests to the same NS and record have the same ID, arecvfrom
return LWResult::Result::PermanentError
for one of them.
Some more background (from knowledge of the internals of the recursor, I did not reproduce or confirm you analysis yet)
with pdns-distributes-queries=yes
the Recursor uses a hash of the query name to select the thread. Since you are using a single query name, all request end up being handled by the same thread.
Withe pdns-distributes-queries=no
the network code of the OS will select a thread (though depending on the OS version, request form a single IP and src port might still end up in the same thread). The chances are lower with pdns-distributes-queries=no
that two identical requests end up in the same thread, though it still can happen. Initially all requests will find the cache empty, so they will initiate a request to the auth and subsequent requests to the same thread will chain. A bunch of simultaneous requests are sent to the auth, the number of them limited to the amount of threads. After the auth responds to those queries, the chains will be emptied and new request will find the answer in the cache.
I reproduced your test and can confirm that the duplicate PacketID
is causing the ServFails
. In my initial reaction I did not realize that the generated id is also used as (part of) the key for chained waiters, even though the id is not actually used for an outgoing request.
Chaining actually happens, but if a outgoing request is chained, it gets a assigned an fd
of -1. So, all chained event share the same "fd
", which increase the chances of a clash in the waiter code: name
, type
, IP
are already the same, and only fd
and qid
can make a difference.
Some thoughts about this:
I do wonder if this is a scenario we encounter in real life.
I'm pondering checking if the chosen qid would case a clash with an existing waiter and respin the random if so until we get a free qid. Of course this will fail of all ids are taken (size of chain is 64k). But the size of the chain is limited by the maximum number of mqueries anyway, so that would not really hurt.
Or maybe there's another key we could use if the event is a chained event/maybe not assign -1 too all fd
, but different (negative) values? The would require remembering that fd
in the chain.
I do wonder if this is a scenario we encounter in real life.
Yes. I have many same request to consul and I see these servfail all the time. That's why I started looking into this issue.
This is fixed in the master branch. Will very likely be backported to future maintenance releases.
Short description
Recursor responds servfail if it receives many identical requests and NS server responds slowly. In traces I see
Error resolving from 192.168.0.10, possible error: Resource temporarily unavailable
, but instrace
I do not see calls that would end with this error.Environment
Steps to reproduce
Expected behaviour
All requests are waiting for a response from NS
Actual behaviour
unreachables
counter increasesOther information
I think the source of the problem it waitEvent return
-1
, because asyncresolve sometimes generates the same ID for different requests.