Recursor responds servfail if it receives many identical requests and NS server responds slowly.

otlich commented 2 months ago

Program: Recursor,
Issue type: Bug report

Short description

Recursor responds servfail if it receives many identical requests and NS server responds slowly. In traces I see Error resolving from 192.168.0.10, possible error: Resource temporarily unavailable, but in strace I do not see calls that would end with this error.

Environment

Operating system: debian12
Software version: pdns-recursor 4.9.6

Steps to reproduce

Send 1000 requests per sec for one name and type to recursor
Use not fast NS that responds at least 100 milliseconds

Expected behaviour

All requests are waiting for a response from NS

Actual behaviour

One outgoing request is sent from recursor to NS
Some requests are awaiting a response from NS
Some requests end with servfail error
unreachables counter increases
In traces I see 'Error resolving from 192.168.0.10, possible error: Resource temporarily unavailable'
Servfail is saved in the cache
All next requests receive a servfail response from the cache
NS responds, cache is updated and all pending requests receive the correct response

Other information

I think the source of the problem it waitEvent return -1, because asyncresolve sometimes generates the same ID for different requests.

omoerbeek commented 2 months ago

The recursor will chain equivalent requests, I do not think clashing IDs are the issue. Chaining means that if a query for a name/type is already sent to a specific IP, no new query will be sent out, and the new client query will just wait on the answer of the earlier query coming in.

unreachable indicates that there is a network issue on the outgoing side. Are all NS reachable from the recursor? It could be that an error reaching an NS is propagated to all chained clients. Also note that 192.168.0.10 is an IP normally in dont-query, so the recursor wil refuse to query it.

Please share your configuration and traces.

omoerbeek commented 2 months ago

Additionally, please also describe your test setup, as it might be that the test uses many resources on the machine recursor is running on. I'd like to rule that out.

otlich commented 2 months ago

My test configs

$ cat /etc/powerdns/recursor.conf 
allow-from=127.0.0.1, 10.0.0.0/8
config-dir=/etc/powerdns
hint-file=/usr/share/dns/root.hints
include-dir=/etc/powerdns/recursor.d
local-address=0.0.0.0
local-port=53
quiet=yes
setgid=pdns
setuid=pdns
dnssec=process-no-validate
extended-resolution-errors=yes
version-string=
max-negative-ttl=5
max-cache-ttl=600
packetcache-servfail-ttl=5
log-common-errors=yes
max-tcp-clients=512
packetcache-ttl=300
reuseport=yes
pdns-distributes-queries=yes
threads=20
forward-zones=example.com=127.0.0.1:5353

$ cat /etc/powerdns/pdns.conf 
local-port=5353
distributor-threads=1
consistent-backends=no
disable-axfr=yes
cache-ttl=0
query-cache-ttl=0
negquery-cache-ttl=0
zone-cache-refresh-interval=0
launch=pipe
pipe-command=/usr/local/bin/dns_pipe.py
pipe-timeout=3000
pipe-abi-version=2
loglevel=9
log-dns-queries=yes

$ cat /usr/local/bin/dns_pipe.py
#!/usr/bin/python3
import sys
import time

def resp(data):
    sys.stdout.write(data+'\n')
    sys.stdout.flush()

while True:
    line = sys.stdin.readline().strip()
    if line.startswith('HELO'):
        resp('OK\tSlow backend')
    if line.startswith('Q'):
        time.sleep(0.05)
        f = line.replace('\t',' ').replace('  ',' ').strip().split(' ')
        qid = f[4]
        d = f'DATA\texample.com\tIN\tSOA\t1\t{qid}\texample.com. mail.example.com. {int(time.time())} 14400 3600 604800 10800'
        resp(d)
        resp('END')

Test process

# rec_control trace-regex 'example\.com' /tmp/pdns_trace
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok

# flame 127.0.0.1 -p 53 -r example.com -g static -c 1 -Q 1000 -l 10 -T A
binding traffic generators to 0.0.0.0
flaming target(s) [127.0.0.1] on port 53 with 1 concurrent generators, each sending 10 queries every 1ms on protocol udp
query generator [static] contains 1 record(s)
rate limit @ 1000 QPS (1000 QPS per concurrent sender)
0.993019s: send: 993, avg send: 993, recv: 991, avg recv: 991, min/avg/max resp: 0.024002/11.0094/106.433ms, in flight: 1, timeouts: 0
1.99392s: send: 999, avg send: 996, recv: 1001, avg recv: 996, min/avg/max resp: 0.02282/3.49844/103.657ms, in flight: 1, timeouts: 0
2.99342s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.023038/5.52326/106.971ms, in flight: 1, timeouts: 0
3.99365s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.023439/5.56908/106.923ms, in flight: 1, timeouts: 0
4.99321s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.022015/5.50145/106.756ms, in flight: 1, timeouts: 0
5.99385s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.024154/5.53143/106.916ms, in flight: 1, timeouts: 0
6.99366s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.02218/5.33692/105.85ms, in flight: 1, timeouts: 0
7.99399s: send: 1000, avg send: 997, recv: 1000, avg recv: 997, min/avg/max resp: 0.021278/5.44307/105.78ms, in flight: 1, timeouts: 0
8.99413s: send: 1001, avg send: 997, recv: 1001, avg recv: 997, min/avg/max resp: 0.022024/2.91637/103.43ms, in flight: 1, timeouts: 0
9.99354s: send: 999, avg send: 997, recv: 999, avg recv: 997, min/avg/max resp: 0.023732/5.53983/106.388ms, in flight: 1, timeouts: 0

------
run id      : 7ffd638a73c0
run start   : 2024-09-04T19:27:14Z
runtime     : 9.99465 s
total sent  : 9992
total rcvd  : 9992
min resp    : 0.021278 ms
avg resp    : 5.58692 ms
max resp    : 106.971 ms
avg r qps   : 997
avg s qps   : 997
avg pkt     : 40 bytes
tcp conn.   : 0
timeouts    : 0 (0%) 
bad recv    : 0
net errors  : 0
responses   :
  SERVFAIL: 155
  NOERROR: 9837

# grep -i error /tmp/pdns_trace | grep -vi -e 'noerror' -e 'No error'
81 example.com: Error resolving from 127.0.0.1, possible error: Success
93 example.com: Error resolving from 127.0.0.1, possible error: Success
85 example.com: Error resolving from 127.0.0.1, possible error: Success
95 example.com: Error resolving from 127.0.0.1, possible error: Success

trace text

 us === START OF TRACE 2024-09-04T22:27:16.042 ===
29 QM example.com: doResolve
30 example.com: Wants DNSSEC processing, auth data required by query for A
32 QM example.com: Step0 Not cached
33 QM example.com: Step0 qname is in a forwarded domain example.com
33 QM example.com: Step2 New child example.com
34 QM example.com: Step3 Going to do final resolve
34 example.com: Wants DNSSEC processing, auth data required by query for A
36 example.com: No cache hit for 'example.com|A', trying to find an appropriate NS record
37 example.com: Checking if we have NS in cache for 'example.com'
37 example.com: No valid/useful NS in cache for 'example.com'
38 example.com: Checking if we have NS in cache for 'com'
39 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'l.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
40 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'g.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
41 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'm.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
43 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'b.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
44 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'c.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
45 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'k.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
47 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'h.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
49 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'e.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
50 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'a.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
51 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'f.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
53 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'd.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
54 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'j.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
56 example.com: NS (with ip, or non-glue) in cache for 'com' -> 'i.gtld-servers.net', within bailiwick: 0, not in cache / did not look at cache
61 example.com: We have NS in cache for 'com' (flawedNSSet=0)
61 example.com authOrForwDomain: example.com nsFromCacheDomain: com isPartof: 1
62 example.com: Using forwarder as NS
63 example.com: Cache consultations done, have 1 NS to contact
65 example.com: Domain has hardcoded nameserver
67 example.com: Nameservers: -127.0.0.1:5353(81.50ms)
70 example.com: Resolved 'example.com' NS (empty) to: 127.0.0.1
71 example.com: Trying IP 127.0.0.1:5353, asking 'example.com|A'
81 example.com: Error resolving from 127.0.0.1, possible error: Success
84 example.com: Failed to resolve via any of the 1 offered NS at level 'example.com'
86 example.com: Failed (res=-1)
87 QM example.com: Step3 Final resolve: Server Failure/0

About possible error, it is different on different servers and sometimes changes.

phonedph1 commented 2 months ago

Can you also provide pdns_recursor --version just to confirm the full version and build

omoerbeek commented 2 months ago

Thanks for the info. I'll try to reproduce, but that will be the day after tomorrow at the earliest.

phonedph1 commented 2 months ago

other general comments

reuseport=yes
pdns-distributes-queries=yes
threads=20

the first 2 are conflicting ideas (distribute queries would help with flame test from a single source though). the second is threads=20 is way too high. start with something more reasonable and then go up if you need to. for reference you can do like 160kqps with only 8 pretty easily.

omoerbeek commented 2 months ago

A thing to read: https://blog.powerdns.com/2024/06/05/powerdns-recursor-5-0-6-released The advice there also holds for 4.9.x and especially if you use a large value for threads.

otlich commented 2 months ago

If I disable pdns-distributes-queries and use reuseport I won't have these errors but will have lots of outgoing requests. Another way to solve this problem is to enable serve-stale-extensions.

Yes, it looks like a network issue but it isn't.

# pdns_recursor --version
Sep 04 23:22:05 PowerDNS Recursor 4.9.6 (C) 2001-2022 PowerDNS.COM BV
Sep 04 23:22:05 Using 64-bits mode. Built using gcc 12.2.0 on May 13 2024 11:19:52 by root@localhost.
Sep 04 23:22:05 PowerDNS comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2.
Sep 04 23:22:05 Features: fcontext libcrypto-ecdsa libcrypto-ed25519 libcrypto-ed448 libcrypto-eddsa lua nod protobuf dnstap-framestream snmp sodium curl DoT scrypt 
Sep 04 23:22:05 Configured with: " '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--sysconfdir=/etc/powerdns' '--enable-systemd' '--with-systemd=/lib/systemd/system' '--enable-unit-tests' '--disable-silent-rules' '--with-service-user=pdns' '--with-service-group=pdns' '--with-libcap' '--with-libsodium' '--with-lua' '--with-net-snmp' '--enable-dns-over-tls' '--enable-dnstap' '--enable-nod' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/pdns/pdns-recursor-4.9.6=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/pdns/pdns-recursor-4.9.6=. -fstack-protector-strong -Wformat -Werror=format-security'"

otlich commented 2 months ago

I think happening is this:

asyncresolve gen new ID for request https://github.com/PowerDNS/pdns/blob/9e58e58718b1bfc61c8a47a9f676a02b8d7b39fd/pdns/recursordist/lwres.cc#L396
(does something)
asyncresolve start arecvfrom https://github.com/PowerDNS/pdns/blob/3264365937199e20ad4edec245a79501e763562e/pdns/recursordist/lwres.cc#L505
arecvfrom add this ID in to pident->id https://github.com/PowerDNS/pdns/blob/3264365937199e20ad4edec245a79501e763562e/pdns/recursordist/pdns_recursor.cc#L357
arecvfrom execute g_multiTasker->waitEvent with this pident https://github.com/PowerDNS/pdns/blob/3264365937199e20ad4edec245a79501e763562e/pdns/recursordist/pdns_recursor.cc#L363
waitEvent check uniq key (pident) https://github.com/PowerDNS/pdns/blob/9e58e58718b1bfc61c8a47a9f676a02b8d7b39fd/pdns/mtasker.cc#L175-L177

If different requests to the same NS and record have the same ID, arecvfrom return LWResult::Result::PermanentError for one of them.

omoerbeek commented 2 months ago

Some more background (from knowledge of the internals of the recursor, I did not reproduce or confirm you analysis yet) with pdns-distributes-queries=yes the Recursor uses a hash of the query name to select the thread. Since you are using a single query name, all request end up being handled by the same thread.

Withe pdns-distributes-queries=no the network code of the OS will select a thread (though depending on the OS version, request form a single IP and src port might still end up in the same thread). The chances are lower with pdns-distributes-queries=no that two identical requests end up in the same thread, though it still can happen. Initially all requests will find the cache empty, so they will initiate a request to the auth and subsequent requests to the same thread will chain. A bunch of simultaneous requests are sent to the auth, the number of them limited to the amount of threads. After the auth responds to those queries, the chains will be emptied and new request will find the answer in the cache.

omoerbeek commented 2 months ago

I reproduced your test and can confirm that the duplicate PacketID is causing the ServFails. In my initial reaction I did not realize that the generated id is also used as (part of) the key for chained waiters, even though the id is not actually used for an outgoing request.

Chaining actually happens, but if a outgoing request is chained, it gets a assigned an fd of -1. So, all chained event share the same "fd", which increase the chances of a clash in the waiter code: name, type, IP are already the same, and only fd and qid can make a difference.

Some thoughts about this:

I do wonder if this is a scenario we encounter in real life.
I'm pondering checking if the chosen qid would case a clash with an existing waiter and respin the random if so until we get a free qid. Of course this will fail of all ids are taken (size of chain is 64k). But the size of the chain is limited by the maximum number of mqueries anyway, so that would not really hurt.
Or maybe there's another key we could use if the event is a chained event/maybe not assign -1 too all fd, but different (negative) values? The would require remembering that fd in the chain.

otlich commented 2 months ago

I do wonder if this is a scenario we encounter in real life.

Yes. I have many same request to consul and I see these servfail all the time. That's why I started looking into this issue.

omoerbeek commented 2 months ago

This is fixed in the master branch. Will very likely be backported to future maintenance releases.

PowerDNS / pdns