Closed bnkas closed 3 years ago
I don't have to wait 600 seconds, it resolves if I send a second query almost immediately.
Hello, I'm curious about this..
allow-from=127.0.0.0/8, 10.0.0.0/8
local-address=127.0.0.1, 176.1.1.1
query-local-address=176.1.1.1
Is allow-from
missing a range?
Thanks
Ah, "almost immediately" was about 45 seconds; speedtest-prod.xcr.comcast.net
NSEC
TTL is 30.
@mnordhoff If I send a second request within the TTL, I get the cached response (0 msec query time) with NXDOMAIN. I just tried after waiting 300 msec.
@setharnold 176.1.1.1 is the public IP (1.1.1 replaces the actual octets). When the server is public facing, I do indeed add the "176.1.1.1" in the allow from. But this one is local only, so I removed the public IP from that directive.
It may be that you're hitting the packet cache while whatever I did (maybe switching from A
to AAAA
) bypassed it. I don't know.
Affected response:
$ dig +dnssec speedtest.xfinity.com
; <<>> DiG 9.17.19-1+ubuntu20.04.1+isc+1-Ubuntu <<>> +dnssec speedtest.xfinity.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 16411
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 512
;; QUESTION SECTION:
;speedtest.xfinity.com. IN A
;; ANSWER SECTION:
speedtest.xfinity.com. 300 IN CNAME cdn.speedtest-prod.xcr.comcast.net.
speedtest.xfinity.com. 300 IN RRSIG CNAME 5 3 300 20211118151009 20211101150509 56700 xfinity.com. izvPE2F7M82UIbP2TT5fBenRKxo4sHKtKhdyWJu8NHO7TIccAB5LncNb mBBe4iR8/9zc34//mc6JZfMT5efn4scJnAMU99d2lS9qjPOVbZGQxKZB JOELm3X37/t/vOYTdaLPyAFMBJwQnG8YEFpaoP3fujzbxSeS1FMXRgkW +7Y=
;; AUTHORITY SECTION:
speedtest-prod.xcr.comcast.net. 86400 IN SOA cdn-tr-pan-02.xcr.comcast.net. traffic_ops.speedtest-prod.xcr.comcast.net. 2021110323 28800 7200 604800 30
speedtest-prod.xcr.comcast.net. 30 IN NSEC cdn-tr-atl-02.speedtest-prod.xcr.comcast.net. NS SOA RRSIG NSEC DNSKEY
speedtest-prod.xcr.comcast.net. 86400 IN RRSIG SOA 5 4 86400 20211109015116 20211104005116 38932 speedtest-prod.xcr.comcast.net. EYT41oFgpKUzu34DHVHxcT1nYqHjFRMvLeTOidPYn8A8+RygFNW23tGX sZA0OqvNdRQPTlq+1VUur8v8mI0W3pFnmR6CHU//2/2p6XQVTP7YijMe PvyYetKvRzXv+/+GwY6klnZzj5htvnZQgwLHUnw5Ae19SHhbwn0FIrl4 /cU=
speedtest-prod.xcr.comcast.net. 30 IN RRSIG NSEC 5 4 30 20211109015116 20211104005116 38932 speedtest-prod.xcr.comcast.net. YL0myQZ9bSS99AcGSpJWnbm9eXaOQpfHE3oFFyi1glm9GrIMWyTnNXX5 HAJCTi6eHo306QOCdrW6aPSv7hgiqtSyqsKeLZtf8eOHZOGVs4SZP8IU GcgotkF2S9WsUw3PjlUdPKRXtVeTzU4Bxzo7DbHl/tY0LNj8fmwOmeVh Eag=
;; Query time: 1315 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Thu Nov 04 02:03:52 UTC 2021
;; MSG SIZE rcvd: 778
In particular:
speedtest-prod.xcr.comcast.net. 30 IN NSEC cdn-tr-atl-02.speedtest-prod.xcr.comcast.net. NS SOA RRSIG NSEC DNSKEY
IFF cdn
sorts before cdn-tr-atl-02
(I don't want to look up DNSSEC canonical ordering), Comcast is denying that cdn
exists and Recursor aggressive NSEC
is behaving correctly. And it just happens to "work" when the resolver doesn't happen to have the relevant NSEC
record in its cache.
With the right queries, I can coax Unbound and a proprietary implementation into returning NXDOMAIN
for cdn
too.
IFF cdn sorts before cdn-tr-atl-02
It does :)
Yep, aggressive NSEC is indeed picking up the denial of existence proof and synthesizing an answer from that:
Nov 04 09:54:18 1 [2/1] question for 'cdn.speedtest-prod.xcr.comcast.net|A' from 127.0.0.1:37908
Nov 04 09:54:18 [2] : no TA found for 'cdn.speedtest-prod.xcr.comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'speedtest-prod.xcr.comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'xcr.comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'comcast.net' among 1
Nov 04 09:54:18 [2] : no TA found for 'net' among 1
Nov 04 09:54:18 [2] : got TA for '.'
Nov 04 09:54:18 [2] QM cdn.speedtest-prod.xcr.comcast.net.|A child=(empty): doResolve
Nov 04 09:54:18 [2] cdn.speedtest-prod.xcr.comcast.net: Wants DNSSEC processing, auth data in query for A
Nov 04 09:54:18 [2] cdn.speedtest-prod.xcr.comcast.net: Recursion not requested for 'cdn.speedtest-prod.xcr.comcast.net|A', peeking at auth/forward zones
Nov 04 09:54:18 Looking for a NSEC before cdn.speedtest-prod.xcr.comcast.net: found a possible NSEC at speedtest-prod.xcr.comcast.net cdn.speedtest-prod.xcr.comcast.net is covered by (speedtest-prod.xcr.comcast.net to cdn-tr-atl-02.speedtest-prod.xcr.comcast.net) and it proves that the name does not exist
Nov 04 09:54:18 Now looking for a NSEC before the wildcard *.speedtest-prod.xcr.comcast.net: found a possible NSEC at speedtest-prod.xcr.comcast.net *.speedtest-prod.xcr.comcast.net is covered by (speedtest-prod.xcr.comcast.net to cdn-tr-atl-02.speedtest-prod.xcr.comcast.net) and it proves that there is no matching wildcard
Nov 04 09:54:18 Found valid NSECs covering the requested name and type!
Nov 04 09:54:18 [2] QM cdn.speedtest-prod.xcr.comcast.net.|A child=(empty): Step0 Found in cache
Nov 04 09:54:18 Answer to cdn.speedtest-prod.xcr.comcast.net|A for 127.0.0.1:37908 validates correctly
Nov 04 09:54:18 1 [2/1] answer to question 'cdn.speedtest-prod.xcr.comcast.net|A': 0 answers, 1 additional, took 0 packets, 0 netw ms, 0 tot ms, 0 throttled, 0 timeouts, 0/0 tcp/dot connections, rcode=3, dnssec=Secure
So the speedtest-prod.xcr.comcast.net.
zone has broken DNSSEC and needs to be fixed. In the meantime a negative trust anchor can be used to work around the issue: https://docs.powerdns.com/recursor/lua-config/dnssec.html#addNTA
I've passed this onto a contact there.
@phonedph1 Thank you for that.
@rgacogne Thank you for the insight and reference to negative trust anchor. And thanks to everyone else who helped identify the root cause.
My concern is that this type of issue could also be happening for other domains as well. I have seen folks here say that "aggressive" NSEC is generating the NXDOMAIN response. When digging using other recursive resolvers with dnssec enabled, they are able to successfully resolve it to an IP address.
Could there be a way (a configuration option that can be added in the future) to controls how aggressive NSEC behaves?
Maybe the default can be "aggressive" but other options can make it more lax or sort differently (similar to other recursors) so that it would resolve to an IP address in the case above.
Thanks.
Aggressive NSEC caching can be disabled by setting aggressive-nsec-cache-size
1 to 0. It won't solve the fact that something in that zone is broken, though, and thus it could break in other ways as well.
@rgacogne Thank you for that explanation. I didn't fully understand that setting previously.
Adding aggressive-nsec-cache-size=0
to the recursor config file does indeed address the NXDOMAIN concern I had, as the recursor now reliably responds with IPs every time I dig for speedtest.xfinity.com.
From my perspective, I consider this issue closed as I now know of a way to control aggressive NSEC.
Thanks
Do note that aggressive NSEC caching is a effective mechanism to reduce the number of queries to authoritative servers 1, there is a reason it is switched on by default.
And there aren't that many broken zones that cause problems.
I suspect that domains hosted and signed on F5 are mainly affected. This is a long known problem. https://en.blog.nic.cz/2019/07/10/error-in-dnssec-implementation-on-f5-big-ip-load-balancers/
Short description
When PowerDNS recursor is starting fresh, speedtest.xfinity.com always results in NXDOMAIN response after a long query time (often over 1500 msec). This is the only subdomain I can consistently replicate the behavior on.
Environment
Steps to reproduce
systemctl restart pdns-recursor.service
rec_control trace-regex '.*\.xfinity.com\.$'
dig speedtest.xfinity.com
Here is the only config file (there are no scripts or such)
Expected behaviour
Resolving the subdomain to IPs
Actual behaviour
NXDOMAIN response is received
I also attached the trace logs from the recursor as txt file.: trace-log.txt
Other information
The TTL is 10 minutes on the NXDOMAIN. If wait after the 10 minutes and do the same exact dig again, speedtest.xfinity.com resolves successfully to its IPs (the dig output for that is shown in Expected behaviour section). That's why in the repro steps I mention that cache should be cleared first. The presence of a previously cached queries causes it to resolve correctly the second time; but it always causes NXDOMAIN response the first time. I also replicated the issue on three different PowerDNS recursors on completely different networks/countries.