PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.62k stars 904 forks source link

Inconsistent EDE data in dnstap #14248

Open johnhtodd opened 3 months ago

johnhtodd commented 3 months ago

Short description

Inconsistent messages in dnstap for EDE versus what is provided in query response

Environment

I'm looking at DNSSEC errors (coincidentally, in Amsterdam) for a day or so, and trying to figure out our classes of errors that are handed back in EDE which create a SERVFAIL towards the end user. I've trimmed down the error set - I excluded "No reachable authority" errors (which are rampant)

Here is the set from 24 hours excluding "no reachable authority", from a small sub-section of our AMS cluster.

┌─event.responseData.opt.ede.purpose─┬─errortype─┐
│ ['Network Error']                  │        11 │
│ ['Unsupported DNSKEY Algorithm']   │      1520 │
│ ['Signature Expired']              │     37544 │
│ ['DNSKEY Missing']                 │     49717 │
│ ['RRSIGs Missing']                 │     56164 │
│ ['NSEC Missing']                   │     60216 │
│ ['Synthesized']                    │     75301 │
│ ['DNSSEC Bogus']                   │     78056 │
│ ['Other Error']                    │    277917 │
└────────────────────────────────────┴───────────┘

So what are all those "other error" items? This seems to be an unusually large number in the "catchall" category.

I dug into this a bit, and I need some sanity checking, or perhaps this is a bug.

I found a domain that is coming up with "other error" as reported in the dnstap data set - tracker.publicbt.com. There are ~6000 of those in one of my logfiles, so I figured it would be a good test.

When I look at dnsviz, this is a "refused" error, and sure enough when I do a "dig" I get a no reachable authority result:

jtodd@dev01:~$ dig @9.9.9.9 tracker.publicbt.com

; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> @9.9.9.9 tracker.publicbt.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14781
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; EDE: 22 (No Reachable Authority): (delegation publicbt.com)
;; QUESTION SECTION:
;tracker.publicbt.com.      IN  A

;; Query time: 0 msec
;; SERVER: 9.9.9.9#53(9.9.9.9) (UDP)
;; WHEN: Sun May 26 16:48:38 UTC 2024
;; MSG SIZE  rcvd: 78

jtodd@dev01:~$

But when I look through the dnstap logs, I find that they are not being listed as "no reachable authority" but in fact are showing up as "other error" (info code 0). I find no events in the dnsstap output that shows "no reachable authority" for that name, even though the name appears hundreds of times. All of the errors are "other error" which seems to not match what I see in my actual query results.

I am collecting the data from dnstap, which is sent by dnsdist. pdns-rec is of course behind dnsdist, along with (as usual) unbound, which we currently do not have sending ede results (therefore, unbound answers never appear with any EDE data set, so they are not considered in my searches.)

Is this a dnsdist error with dnstap? Or is this a method problem?

omoerbeek commented 3 months ago

AFAIK dnsdist jus passes the packet received from the resolver (including the embedded EDE if available) to the dnstap stream. The dnstap message itself has no EDE field. i.e. dnsdist does not do any processing wrt EDE. So it would be of interest to see the actual answer received by dnsdist and the corresponding dnstap message produced.

It could be the EDE 0 (Other) are already in the answer sent by the resolver. It's also interesting to see if there's any extra text associated with the EDE 0 code.

johnhtodd commented 3 months ago

I am 100% sure that the queries I am generating/receiving are coming back with "; EDE: 22 (No Reachable Authority): (delegation publicbt.com)" as the result (see "dig" results.) Those exact queries are creating results that show up in Vector (my dnstap parser) with "info code:0" . Now that I think about it, this may be a bug in Vector as it is suspicious that info code "0" is the result in these events - zero is an easy number to reach in a bug condition. Let me pursue that path for a bit. I have nothing that can easily unpack a dnstap message with EDE other than Vector, so it may take a bit.