PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.71k stars 910 forks source link

dnsdist rcode statistics extension and renaming #6255

Open johnhtodd opened 6 years ago

johnhtodd commented 6 years ago

Program: dnsdist

Issue type: Feature request

Short description

For dnsdist, we believe that we’re not counting all of the replies back to clients that we’re actually sending. Quick examination by zeha of the code may actually support that, in particular looking at “servfail” replies which are generated from packetcache do not seem to be stored in any counters that are obvious. The current list of information from the dnsdist stats (https://dnsdist.org/statistics.html#responses) may be incomplete, or at least is not obvious when attempting to ascertain the true response rates of various rcodes to clients. Additionally, we require a more detailed breakdown of what type of reply rcodes are actually being sent to clients, and where those replies are coming from within dnsdist’s process. This allows us to understand behaviors of dnsdist at large scale, and across differing client and network conditions.

Description

Here is the list of current (1.2) items that seem to be relevant when counting replies or client-side data:

cache-hits Number of times an answer was retrieved from cache. cache-misses Number of times an answer was not found in the cache. responses Number of responses received from backends. rule-nxdomain Number of NXDomain answers returned because of a rule. rule-refused Number of Refused answers returned because of a rule. self-answered Number of self-answered responses. servfail-responses Number of servfail answers received from backends.

What was discussed on IRC was possibly breaking up various rcode types into more specific statistics counters that consider replies of each rcode type and where the response was originated for the purposes of better understanding dnsdist performance.

New naming and rcode statistics proposal:

replies-<rcode> Number of responses sent to clients of this rcode type (total) replies-packetcache-<rcode> Number of responses sent to clients of this rcode type that were from packetcache replies-rule-<rcode> Number of responses sent to clients from dnsdist internal rules backend-responses-<rcode> Number of responses received from backend servers of this rcode type

The \<rcode> value would be taken from the IANA list and would be the “human-readable” name in lower case such as “noerror” or “refused”. (https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml) Currently, values in the range of 0-15 are interesting, but having the extended rcodes would also be useful as they become adopted. Replies with rcodes not matching the existing IANA-recognized list could be ignored, or put into a single bucket, or their numeric values could be used as the rcode summary (this is left unspecified - someone else with other requirements may need this, but not us.)

It may be the case that even further breakdown of responses from backends might be useful, for example creating “backend-responses-\<backend-name>-\<rcode>” so that each backend system could be evaluated separately, and measured for performance. This is a non-trivial expansion of the original concept of simply collecting rcode statistics to clients, and perhaps may be a separate ticket.

By creating these new buckets, several existing statistics would probably be renamed so that they would match the new structure:

rule-refused -> replies-rule-refused rule-nxdomain -> replies-rule-nxdomain servfail-responses -> backend-responses-servfail

Additionally, it would seem to follow that these statements would be true (please correct me if this is not the case!)

zeha commented 6 years ago

Also the docs should document all of that.

johnhtodd commented 6 years ago

More thinking on this: should the counters towards clients be expanded and named so that the bound address/port/protocol is included in the results? examples: "replies-0.0.0.0-53-UDP-noerror" or "replies-packetcache-2001:db8::a3-53-TCP-noerror" Most sites have at least v4 and v6, and being able to distinguish between them would seem useful. Our site has many IP addresses associated with a single dnsdist instance, and getting good response statistics on a per address/port/protocol basis seems like a reasonable extension to this idea if there is going to be overhaul of the code.

zeha commented 6 years ago

Queries(/responses) that are dropped because of a ResponseAction are partially accounted:

zeha commented 6 years ago

Queries that hit a pool with no (up) servers will increase "no-policy", but there is no separate statistics bucket for muted UDP clients.

zeha commented 6 years ago

Some notes (for myself mostly):

zeha commented 6 years ago

pushed some initial code to https://github.com/zeha/pdns/tree/dnsdist-stats

johnhtodd commented 6 years ago

zeha - Who else would be interested in this do you think? I'd like to get some more input on if this is a useful fix. Does rgacogne's recent patch of https://github.com/PowerDNS/pdns/pull/6563 have add a previously un-monitored dimension to this? (drops)

zeha commented 6 years ago

Unsure who else would be interested (at least publicly).

drops in #6563 are per downstream server; if we do backend-responses-<server>..., then we should look at that too.

rgacogne commented 6 years ago

Note that the drop rate is the number of reused per seconds over the last period, so this can easily be derived from reused even before #6563, if needed.