PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.71k stars 908 forks source link

Recursor: NXDOMAIN response for certain subdomains #10947

Closed bnkas closed 3 years ago

bnkas commented 3 years ago

Short description

When PowerDNS recursor is starting fresh, speedtest.xfinity.com always results in NXDOMAIN response after a long query time (often over 1500 msec). This is the only subdomain I can consistently replicate the behavior on.

Environment

Steps to reproduce

  1. Restart the recursor to remove cache: systemctl restart pdns-recursor.service
  2. Enabled tracing for the domain: rec_control trace-regex '.*\.xfinity.com\.$'
  3. Dig for the domain: dig speedtest.xfinity.com

Here is the only config file (there are no scripts or such)

allow-from=127.0.0.0/8, 10.0.0.0/8
config-dir=/etc/powerdns
ecs-minimum-ttl-override=10
distributor-threads=1
dnssec=process
hint-file=/usr/share/dns/named.root
include-dir=/etc/powerdns/recursor.d
local-address=127.0.0.1, 176.1.1.1
local-port=53
lowercase-outgoing=yes
lua-config-file=/etc/powerdns/recursor.lua
max-cache-bogus-ttl=3600
max-cache-entries=10000000
max-cache-ttl=86400
max-concurrent-requests-per-tcp-connection=1000
max-mthreads=3584
max-negative-ttl=3600
max-packetcache-entries=5000000
minimum-ttl-override=600
nothing-below-nxdomain=dnssec
pdns-distributes-queries=yes
public-suffix-list-file=/usr/share/publicsuffix/public_suffix_list.dat
qname-minimization=yes
query-local-address=176.1.1.1
quiet=yes
refresh-on-ttl-perc=10
statistics-interval=600
setgid=pdns
setuid=pdns
threads=4

Expected behaviour

Resolving the subdomain to IPs

; <<>> DiG 9.16.22-Debian <<>> speedtest.xfinity.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12042
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;speedtest.xfinity.com.     IN  A

;; ANSWER SECTION:
speedtest.xfinity.com.  600 IN  CNAME   cdn.speedtest-prod.xcr.comcast.net.
cdn.speedtest-prod.xcr.comcast.net. 600 IN A    76.96.120.26
cdn.speedtest-prod.xcr.comcast.net. 600 IN A    96.96.229.208
cdn.speedtest-prod.xcr.comcast.net. 600 IN A    69.252.62.158
cdn.speedtest-prod.xcr.comcast.net. 600 IN A    96.96.229.240
cdn.speedtest-prod.xcr.comcast.net. 600 IN A    69.252.61.190

;; Query time: 2231 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Nov 03 21:30:24 EDT 2021
;; MSG SIZE  rcvd: 178

Actual behaviour

NXDOMAIN response is received

; <<>> DiG 9.16.22-Debian <<>> speedtest.xfinity.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 47442
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;speedtest.xfinity.com.     IN  A

;; ANSWER SECTION:
speedtest.xfinity.com.  600 IN  CNAME   cdn.speedtest-prod.xcr.comcast.net.

;; AUTHORITY SECTION:
speedtest-prod.xcr.comcast.net. 86399 IN SOA    cdn-tr-ric-02.xcr.comcast.net. traffic_ops.speedtest-prod.xcr.comcast.net. 2021110318 28800 7200 604800 30

;; Query time: 2387 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Nov 03 21:12:05 EDT 2021
;; MSG SIZE  rcvd: 160

I also attached the trace logs from the recursor as txt file.: trace-log.txt

Other information

The TTL is 10 minutes on the NXDOMAIN. If wait after the 10 minutes and do the same exact dig again, speedtest.xfinity.com resolves successfully to its IPs (the dig output for that is shown in Expected behaviour section). That's why in the repro steps I mention that cache should be cleared first. The presence of a previously cached queries causes it to resolve correctly the second time; but it always causes NXDOMAIN response the first time. I also replicated the issue on three different PowerDNS recursors on completely different networks/countries.

mnordhoff commented 3 years ago

I don't have to wait 600 seconds, it resolves if I send a second query almost immediately.

setharnold commented 3 years ago

Hello, I'm curious about this..

allow-from=127.0.0.0/8, 10.0.0.0/8
local-address=127.0.0.1, 176.1.1.1
query-local-address=176.1.1.1

Is allow-from missing a range?

Thanks

mnordhoff commented 3 years ago

Ah, "almost immediately" was about 45 seconds; speedtest-prod.xcr.comcast.net NSEC TTL is 30.

bnkas commented 3 years ago

@mnordhoff If I send a second request within the TTL, I get the cached response (0 msec query time) with NXDOMAIN. I just tried after waiting 300 msec.

@setharnold 176.1.1.1 is the public IP (1.1.1 replaces the actual octets). When the server is public facing, I do indeed add the "176.1.1.1" in the allow from. But this one is local only, so I removed the public IP from that directive.

mnordhoff commented 3 years ago

It may be that you're hitting the packet cache while whatever I did (maybe switching from A to AAAA) bypassed it. I don't know.

mnordhoff commented 3 years ago

Affected response:

$ dig +dnssec speedtest.xfinity.com

; <<>> DiG 9.17.19-1+ubuntu20.04.1+isc+1-Ubuntu <<>> +dnssec speedtest.xfinity.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 16411
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 512
;; QUESTION SECTION:
;speedtest.xfinity.com.         IN      A

;; ANSWER SECTION:
speedtest.xfinity.com.  300     IN      CNAME   cdn.speedtest-prod.xcr.comcast.net.
speedtest.xfinity.com.  300     IN      RRSIG   CNAME 5 3 300 20211118151009 20211101150509 56700 xfinity.com. izvPE2F7M82UIbP2TT5fBenRKxo4sHKtKhdyWJu8NHO7TIccAB5LncNb mBBe4iR8/9zc34//mc6JZfMT5efn4scJnAMU99d2lS9qjPOVbZGQxKZB JOELm3X37/t/vOYTdaLPyAFMBJwQnG8YEFpaoP3fujzbxSeS1FMXRgkW +7Y=

;; AUTHORITY SECTION:
speedtest-prod.xcr.comcast.net. 86400 IN SOA    cdn-tr-pan-02.xcr.comcast.net. traffic_ops.speedtest-prod.xcr.comcast.net. 2021110323 28800 7200 604800 30
speedtest-prod.xcr.comcast.net. 30 IN   NSEC    cdn-tr-atl-02.speedtest-prod.xcr.comcast.net. NS SOA RRSIG NSEC DNSKEY
speedtest-prod.xcr.comcast.net. 86400 IN RRSIG  SOA 5 4 86400 20211109015116 20211104005116 38932 speedtest-prod.xcr.comcast.net. EYT41oFgpKUzu34DHVHxcT1nYqHjFRMvLeTOidPYn8A8+RygFNW23tGX sZA0OqvNdRQPTlq+1VUur8v8mI0W3pFnmR6CHU//2/2p6XQVTP7YijMe PvyYetKvRzXv+/+GwY6klnZzj5htvnZQgwLHUnw5Ae19SHhbwn0FIrl4 /cU=
speedtest-prod.xcr.comcast.net. 30 IN   RRSIG   NSEC 5 4 30 20211109015116 20211104005116 38932 speedtest-prod.xcr.comcast.net. YL0myQZ9bSS99AcGSpJWnbm9eXaOQpfHE3oFFyi1glm9GrIMWyTnNXX5 HAJCTi6eHo306QOCdrW6aPSv7hgiqtSyqsKeLZtf8eOHZOGVs4SZP8IU GcgotkF2S9WsUw3PjlUdPKRXtVeTzU4Bxzo7DbHl/tY0LNj8fmwOmeVh Eag=

;; Query time: 1315 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Thu Nov 04 02:03:52 UTC 2021
;; MSG SIZE  rcvd: 778

In particular:

speedtest-prod.xcr.comcast.net. 30 IN   NSEC    cdn-tr-atl-02.speedtest-prod.xcr.comcast.net. NS SOA RRSIG NSEC DNSKEY

IFF cdn sorts before cdn-tr-atl-02 (I don't want to look up DNSSEC canonical ordering), Comcast is denying that cdn exists and Recursor aggressive NSEC is behaving correctly. And it just happens to "work" when the resolver doesn't happen to have the relevant NSEC record in its cache.

With the right queries, I can coax Unbound and a proprietary implementation into returning NXDOMAIN for cdn too.

pieterlexis commented 3 years ago

IFF cdn sorts before cdn-tr-atl-02

It does :)

rgacogne commented 3 years ago

Yep, aggressive NSEC is indeed picking up the denial of existence proof and synthesizing an answer from that:

Nov 04 09:54:18 1 [2/1] question for 'cdn.speedtest-prod.xcr.comcast.net|A' from 127.0.0.1:37908
Nov 04 09:54:18 [2] : no TA found for 'cdn.speedtest-prod.xcr.comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'speedtest-prod.xcr.comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'xcr.comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'net' among 1
Nov 04 09:54:18 [2] : got TA for '.'
Nov 04 09:54:18 [2] QM cdn.speedtest-prod.xcr.comcast.net.|A child=(empty): doResolve
Nov 04 09:54:18 [2] cdn.speedtest-prod.xcr.comcast.net: Wants DNSSEC processing, auth data in query for A
Nov 04 09:54:18 [2] cdn.speedtest-prod.xcr.comcast.net: Recursion not requested for 'cdn.speedtest-prod.xcr.comcast.net|A', peeking at auth/forward zones
Nov 04 09:54:18 Looking for a NSEC before cdn.speedtest-prod.xcr.comcast.net: found a possible NSEC at speedtest-prod.xcr.comcast.net cdn.speedtest-prod.xcr.comcast.net is covered by (speedtest-prod.xcr.comcast.net to cdn-tr-atl-02.speedtest-prod.xcr.comcast.net)  and it proves that the name does not exist
Nov 04 09:54:18 Now looking for a NSEC before the wildcard *.speedtest-prod.xcr.comcast.net: found a possible NSEC at speedtest-prod.xcr.comcast.net *.speedtest-prod.xcr.comcast.net is covered by (speedtest-prod.xcr.comcast.net to cdn-tr-atl-02.speedtest-prod.xcr.comcast.net)  and it proves that there is no matching wildcard
Nov 04 09:54:18 Found valid NSECs covering the requested name and type!
Nov 04 09:54:18 [2] QM cdn.speedtest-prod.xcr.comcast.net.|A child=(empty): Step0 Found in cache
Nov 04 09:54:18 Answer to cdn.speedtest-prod.xcr.comcast.net|A for 127.0.0.1:37908 validates correctly
Nov 04 09:54:18 1 [2/1] answer to question 'cdn.speedtest-prod.xcr.comcast.net|A': 0 answers, 1 additional, took 0 packets, 0 netw ms, 0 tot ms, 0 throttled, 0 timeouts, 0/0 tcp/dot connections, rcode=3, dnssec=Secure

So the speedtest-prod.xcr.comcast.net. zone has broken DNSSEC and needs to be fixed. In the meantime a negative trust anchor can be used to work around the issue: https://docs.powerdns.com/recursor/lua-config/dnssec.html#addNTA

phonedph1 commented 3 years ago

I've passed this onto a contact there.

bnkas commented 3 years ago

@phonedph1 Thank you for that.

@rgacogne Thank you for the insight and reference to negative trust anchor. And thanks to everyone else who helped identify the root cause.

My concern is that this type of issue could also be happening for other domains as well. I have seen folks here say that "aggressive" NSEC is generating the NXDOMAIN response. When digging using other recursive resolvers with dnssec enabled, they are able to successfully resolve it to an IP address.

Could there be a way (a configuration option that can be added in the future) to controls how aggressive NSEC behaves?

Maybe the default can be "aggressive" but other options can make it more lax or sort differently (similar to other recursors) so that it would resolve to an IP address in the case above.

Thanks.

rgacogne commented 3 years ago

Aggressive NSEC caching can be disabled by setting aggressive-nsec-cache-size1 to 0. It won't solve the fact that something in that zone is broken, though, and thus it could break in other ways as well.

bnkas commented 3 years ago

@rgacogne Thank you for that explanation. I didn't fully understand that setting previously.

Adding aggressive-nsec-cache-size=0 to the recursor config file does indeed address the NXDOMAIN concern I had, as the recursor now reliably responds with IPs every time I dig for speedtest.xfinity.com.

From my perspective, I consider this issue closed as I now know of a way to control aggressive NSEC.

Thanks

omoerbeek commented 3 years ago

Do note that aggressive NSEC caching is a effective mechanism to reduce the number of queries to authoritative servers 1, there is a reason it is switched on by default.

mnordhoff commented 3 years ago

And there aren't that many broken zones that cause problems.

paddg commented 3 years ago

I suspect that domains hosted and signed on F5 are mainly affected. This is a long known problem. https://en.blog.nic.cz/2019/07/10/error-in-dnssec-implementation-on-f5-big-ip-load-balancers/