Misleading timeout logs for recursor with high load and small number of threads

vin01 commented 4 years ago

Program: Recursor
Issue type: Improvement

Short description

Recently I used trace queries to debug some ServFail responses under medium load (300/400 queries per second) and saw following failure logs:

timeout resolving after 1515.71msec
Trying IP 127.0.0.1:5300, asking 'example.intranet.domain:AAAA'
...

Network timeout is set to default 1500 msec and the pdns auth server running locally on the same node. The log seemed to indicate some issue with the pdns auth server which did not seem to respond within expected 1500 msec.

However, the issue turned out to be the recursor itself running with a small number of threads, raising thread count with threads=400 to 400 got rid ot these timeouts and hence ServFails. Before this, recursor was running with default thread count of 2 for my 2 core machine.

I also tried setting distributor-threads and receiver-threads to higher values which did not help much. Also auth server indicated no issues in logs so I believe this was purely a recursor issue. The performance guide also indicates to set threads to the number of cores which did not really work well for me with high number of queries.

Environment

Operating system:

Ubuntu 18.04

Software version:
Software source:

pdns-backend-bind/unknown,now 4.3.0-1pdns.bionic amd64 [installed,automatic]
pdns-backend-pgsql/unknown,now 4.3.0-1pdns.bionic amd64 [installed]
pdns-recursor/unknown,now 4.3.1-1pdns.bionic amd64 [installed]
pdns-server/unknown,now 4.3.0-1pdns.bionic amd64 [installed]

Steps to reproduce

High number of concurrent AAAA queries (400/s) in my case.

Expected behaviour

A better log message perhaps indicating internal wait time exhaustion would be more helpful here rather than generic timeout log.

Actual behaviour

A log indicating timeout which implies a timeout in getting a response from the auth server. Alternatively, perhaps it can be mentioned in documentation explicitly to avoid confusion.

Other information

Looking at the code, the timeout log seems to be based on the 0 value for resolveret (https://github.com/PowerDNS/pdns/blob/ea07a19abfb8c92e23d292f3b8c30ff93facaf19/pdns/syncres.cc#L3238). However this seems to be based on the return value from waitevent in mtasker which I believe can also mean that it simply waited for the event for that long, the reason for which can simply also be that there are too many events taking place at once and not enough threads to handle those. The value of 0 does not necessarily indicate a network timeout to the auth server in that case but can also mean a long internal wait time (https://github.com/PowerDNS/pdns/blob/master/pdns/mtasker.cc#L170).

PS: I discussed this on IRC as well but did not reach any conclusion and will be happy to discuss further there or on the mailing list if that is deemed more appropriate.

phonedph1 commented 4 years ago

Do you have actual captures that show queries not being sent or being replied within the required timeout to prove this theory?

A lot of what you've written doesn't make much sense in how the recursor operates. I can only guess by setting such an excessively high thread count that you have simply slowed your recursor down to the point where your auth setup is able to keep up. I suspect the problem is still on the other end.

vin01 commented 4 years ago

Statictics from the auth server API:

    {
        "name": "servfail-packets",
        "type": "StatisticItem",
        "value": "0"
    },

    {
        "name": "servfail-queries",
        "size": "10000",
        "type": "RingStatisticItem",
        "value": []
    },

At the same time, I see a lot of servfail responses form the recursor to the queries. Also I can validate the same via tcpdump. I believe this rules out auth server being the issue here? I also see nothing in pdns logs with different loglevels that I have tried. Is there anything else that i shall look into for auth server? I am sure the statistics are working fine here as I also see nxdomain results here in the statistics which are as expected.

About slowing down the recursor with 400 threads, I looked at metrics for last 2 days when this has been running with 400 threads and there is no increase, rather decrease in queries responded to afyer 1 second, or within 100ms. Most queries have been responded to within 10ms.

dnsperf also does not show any significant change in the performance, only better that the servfails disappear.

Indeed it does not make sense which is precisely why I initially did not just go ahead and increase the thread count but was trying out things like: increasing cache-ttl, increasing ttl for internal domains which fail, raising receiver-threads and distributor-threads counts for auth server etc etc etc none of which worked. Above was the only thing that helped. I am also keeping a tab in cpu usage for the node and the usage has been quite low, now with 400 threads for recursor I see slightly high usage which is expected, nothing concerning. System load is way below than it can handle.

omoerbeek commented 4 years ago

To make the report more useful I believe you should share the complete config of your recursor and auth setup that showed the issue. Please also share your test setup. Without those there is little relevant we can say.

vin01 commented 4 years ago

pdns.conf and from all files under pdns.d:

cache-ttl=60
include-dir=/etc/powerdns/pdns.d
launch=
log-dns-details=yes
log-dns-queries=yes
loglevel=5
receiver-threads=2
reuseport=yes
setgid=pdns
setuid=pdns
launch+=bind
bind-config=/etc/powerdns/named.conf
bind-supermaster-config=/var/lib/powerdns/supermaster.conf
bind-supermaster-destdir=/var/lib/powerdns/zones.slave.d
api=yes
api-key=[REDACTED]
local-ipv6=
local-address=127.0.0.1
local-port=5300
launch+=gpgsql
gpgsql-host=/var/run/postgresql
gpgsql-dbname=pdns
gpgsql-user=pdns
gpgsql-password=
gpgsql-dnssec=no

recursor.conf and files under recursor.d:

local-address=::,127.0.0.1
local-port=53
forward-zones=intranet.domain=127.0.0.1:5300
forward-zones-recurse=.=[REDACTED GOOGLE V6 and V4 IPs]
allow-from=::1,127.0.0.1,[INTRANET_V6_RANGE]
query-local-address6=::
webserver=yes
webserver-address=::
reuseport=yes
config-dir=/etc/powerdns
hint-file=/usr/share/dns/root.hints
include-dir=/etc/powerdns/recursor.d
quiet=yes
security-poll-suffix=
setgid=pdns
setuid=pdns
trace=fail

It is a single node setup with both recursor and auth server running with above config on a 2 core VM. Average system load is at load average: 0.05, 0.03, 0.06 consistently. If you have any suggestions to try out different parameters for auth server tuning/tracing, I can try those out and share the results.

I heve also ruled out postgresql backend as the issue here by enabling logging for queries slower than 500ms, that also does not seem to be a bottleneck.

vin01 commented 4 years ago

I wonder if it applies here and this blocking behavior is what I am encountering:

This is both good and bad, but mostly good. It means PowerDNS does not have to think about locking. No two threads will ever be talking to the DNS cache at the same time, for example.

It also means that the recursor could block if any operation takes too long.

https://doc.powerdns.com/recursor/appendices/internals.html#synchronous-code-using-mtasker

rgacogne commented 4 years ago

Would you mind providing the exact dnsperf command you are using? You mentioned on IRC that it only happens with very specific queries and not with others, which seems highly relevant.

vin01 commented 4 years ago

dnsperf -s [REDACTED_V6_ADDR] -d queryfile.internal -Q 172   -T 69  -f inet6  -m udp -c 172 -e

I am being explicit in specifying the arguments. Using TCP here also works fine with a low number of threads and I don't experience failures.

omoerbeek commented 4 years ago

Please read https://blog.powerdns.com/2016/01/18/open-source-support-out-in-the-open/ we do not like redacted info in config files or logs. Also, please be more forthcoming with sharing information. Your case seems very specific, so we need all the details.

That said:

First of all disable quiet=yes. It might suppress some useful info.

Next: why are your forwarding everything? It could be that upstream is rate limiting you. It is not clear which queries fail, which is essential information.

The recursor itself is very well suited to doing the work for the non-intranet domains.

vin01 commented 4 years ago

I do understand that missing information of course will not allow you to help me here. But I don't think I have redacted any useful information here, only the IP addresses, API key and domain names themselves while explicitly indicating that these are the intranet domains which fail which are supposed to be resolved locally and not forwarded anywhere. Most queries are responded to within 10ms. Is there any particular redacted info here sharing which you think can help? The config overall is pretty minimal.

With trace=fail enabled, won't every failure be logged anyways for recursor? I also used trace-regex to log specific domain traces here. With quiet=no I only get additional info logs for queries, nothing useful in terms of indicating some failures.

omoerbeek commented 4 years ago

To me it wasn't clear if the failure only occurred for internal queries and the reported stats might give us some clues.

Anyway, can you try with the option

dont-throttle-netmasks=127.0.0.1/32

It;'s a bit of a guess, but It might be that the auth server gets throttled for some reason. No clue yet why that would only be done for dnsperf -e though.

vin01 commented 4 years ago

that does not help unfortunately, I have also been monitoring pdns_recursor_throttled_outqueries, pdns_recursor_throttled_out and pdns_recursor_throttle_entries all of which are 0 while the failures occur.

omoerbeek commented 4 years ago

Looking at your config I see it only listens on 127.0.01 (there are two local-address lines in your recursor.conf, and the second takes precedence. I do not see how that can combine with your dnsperf IPv6 only run.

vin01 commented 4 years ago

Looking at your config I see it only listens on 127.0.01 (there are two local-address lines in your recursor.conf, and the second takes precedence. I do not see how that can combine with your dnsperf IPv6 only run.

Apologies, my mistake while putting out config from different files under recursor.d/! Just fixed the comment. It listens on both v6 and v4.

omoerbeek commented 4 years ago

I set up a test much like your setup. Using a bind backend with zone:

$ORIGIN example.net.
$TTL 86400
@ SOA dns1.example.net. hostmaster.example.net. (
  2001062501 ; serial
  21600      ; refresh after 6 hours
  3600       ; retry after 1 hour
  604800     ; expire after 1 week
  86400 )    ; minimum TTL of 1 daycom

        NS      dns1.example.net.

dns1    A       127.0.0.1

$GENERATE 1-100000      $       A       1.2.3.4

And a datafile called a with contains 10000 lines:

1.example.net A
...
10000.example.net A

A test run gets me:

$ dnsperf -s ::1 -l 60 -d a -T69 -f inet6 -m udp -c 172 -e 
DNS Performance Testing Tool
Version 2.3.4

[Status] Command line: dnsperf -s ::1 -l 60 -d a -T69 -f inet6 -m udp -c 172 -e
[Status] Sending queries (to ::1)
[Status] Started at: Mon Jun  8 13:00:41 2020
[Status] Stopping after 60.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         2802127
  Queries completed:    2802127 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 2802127 (100.00%)
  Average packet size:  request 44, response 60
  Run time (s):         60.018660
  Queries per second:   46687.596824

  Average Latency (s):  0.002124 (min 0.000021, max 0.022849)
  Latency StdDev (s):   0.001588

I also tried other values in the dnsperf call and a few variations of reuseport=yes/no and pdns-distributes-queries=yes/no. All without any ServFail or lost reply, just some (slight) variations in qps.

This is on 4 core VM running Debian buster, with rec 4.3.1 and auth 4.3.0.

So I'm out of suggestions for the recursor atm. But your thread setting of 400 makes no sense at all...

I would suggest you try to simplify your setup to see where the pain is. Note that you can also use an authzone to not be dependent on an external server for your internal zone. See https://docs.powerdns.com/recursor/settings.html#auth-zones

vin01 commented 4 years ago

Thanks for doing the test setup @omoerbeek Apologies as it took me a while to get back to this.

I have run the test on a minimal setup like above and with no other services querying the recursor at the same time.

Most of the times, dnsperf reports no servfails, but once in a while with multiple runs in quick succession, I am able to see some servfails.

Following is the response from multiple runs within few seconds:

[Status] Command line: dnsperf -p 53 -d queryfile.internal -Q 352 -T 100 -m udp -c 122 -e -l 5
[Status] Sending queries (to 127.0.0.1)
[Status] Started at: Tue Jul 21 08:24:05 2020
[Status] Stopping after 5.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         1760
  Queries completed:    1760 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1694 (96.25%), NXDOMAIN 66 (3.75%)
  Average packet size:  request 49, response 69
  Run time (s):         5.002234
  Queries per second:   351.842797

  Average Latency (s):  0.001197 (min 0.000056, max 0.005685)
  Latency StdDev (s):   0.000948

[Status] Command line: dnsperf -p 53 -d queryfile.internal -Q 352 -T 100 -m udp -c 122 -e -l 5
[Status] Sending queries (to 127.0.0.1)
[Status] Started at: Tue Jul 21 08:24:13 2020
[Status] Stopping after 5.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         1760
  Queries completed:    1760 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1694 (96.25%), NXDOMAIN 66 (3.75%)
  Average packet size:  request 49, response 69
  Run time (s):         5.003065
  Queries per second:   351.784356

  Average Latency (s):  0.001258 (min 0.000069, max 0.009366)
  Latency StdDev (s):   0.001194

[Status] Command line: dnsperf -p 53 -d queryfile.internal -Q 352 -T 100 -m udp -c 122 -e -l 5
[Status] Sending queries (to 127.0.0.1)
[Status] Started at: Tue Jul 21 08:24:20 2020
[Status] Stopping after 5.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         1760
  Queries completed:    1760 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1694 (96.25%), NXDOMAIN 66 (3.75%)
  Average packet size:  request 49, response 69
  Run time (s):         5.002041
  Queries per second:   351.856372

  Average Latency (s):  0.001295 (min 0.000057, max 0.005323)
  Latency StdDev (s):   0.001043

[Status] Command line: dnsperf -p 53 -d queryfile.internal -Q 352 -T 100 -m udp -c 122 -e -l 5
[Status] Sending queries (to 127.0.0.1)
[Status] Started at: Tue Jul 21 08:24:27 2020
[Status] Stopping after 5.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         1760
  Queries completed:    1760 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1691 (96.08%), SERVFAIL 3 (0.17%), NXDOMAIN 66 (3.75%)
  Average packet size:  request 49, response 69
  Run time (s):         5.002462
  Queries per second:   351.826761

  Average Latency (s):  0.002892 (min 0.000099, max 1.663595)
  Latency StdDev (s):   0.039673

NXDOMAINs are expected as I put some bad domains in there just to test, doesn't make any difference.

So maybe it is the bursty traffic that can result in such a behavior? In my original setup where I noticed these failures, the DNS queries were being made by prometheus, which is also bursty in nature.

Thanks!

omoerbeek commented 3 years ago

Very very likely fixed by #9707, if not, please re-open.

PowerDNS / pdns