dnsdist rcode statistics extension and renaming

johnhtodd commented 6 years ago

Program: dnsdist

Issue type: Feature request

Short description

For dnsdist, we believe that we’re not counting all of the replies back to clients that we’re actually sending. Quick examination by zeha of the code may actually support that, in particular looking at “servfail” replies which are generated from packetcache do not seem to be stored in any counters that are obvious. The current list of information from the dnsdist stats (https://dnsdist.org/statistics.html#responses) may be incomplete, or at least is not obvious when attempting to ascertain the true response rates of various rcodes to clients. Additionally, we require a more detailed breakdown of what type of reply rcodes are actually being sent to clients, and where those replies are coming from within dnsdist’s process. This allows us to understand behaviors of dnsdist at large scale, and across differing client and network conditions.

Description

Here is the list of current (1.2) items that seem to be relevant when counting replies or client-side data:

cache-hits Number of times an answer was retrieved from cache. cache-misses Number of times an answer was not found in the cache. responses Number of responses received from backends. rule-nxdomain Number of NXDomain answers returned because of a rule. rule-refused Number of Refused answers returned because of a rule. self-answered Number of self-answered responses. servfail-responses Number of servfail answers received from backends.

What was discussed on IRC was possibly breaking up various rcode types into more specific statistics counters that consider replies of each rcode type and where the response was originated for the purposes of better understanding dnsdist performance.

New naming and rcode statistics proposal:

replies-<rcode> Number of responses sent to clients of this rcode type (total) replies-packetcache-<rcode> Number of responses sent to clients of this rcode type that were from packetcache replies-rule-<rcode> Number of responses sent to clients from dnsdist internal rules backend-responses-<rcode> Number of responses received from backend servers of this rcode type

The \<rcode> value would be taken from the IANA list and would be the “human-readable” name in lower case such as “noerror” or “refused”. (https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml) Currently, values in the range of 0-15 are interesting, but having the extended rcodes would also be useful as they become adopted. Replies with rcodes not matching the existing IANA-recognized list could be ignored, or put into a single bucket, or their numeric values could be used as the rcode summary (this is left unspecified - someone else with other requirements may need this, but not us.)

It may be the case that even further breakdown of responses from backends might be useful, for example creating “backend-responses-\<backend-name>-\<rcode>” so that each backend system could be evaluated separately, and measured for performance. This is a non-trivial expansion of the original concept of simply collecting rcode statistics to clients, and perhaps may be a separate ticket.

By creating these new buckets, several existing statistics would probably be renamed so that they would match the new structure:

rule-refused -> replies-rule-refused rule-nxdomain -> replies-rule-nxdomain servfail-responses -> backend-responses-servfail

Additionally, it would seem to follow that these statements would be true (please correct me if this is not the case!)

The existing statistic responses would equal the sum of all backend-responses-<rcode> entries.
The existing statistic cache-hits would equal the sum of all replies-packetcache-<rcode> entries.
The total number of sent responses to clients would be the sum of all replies-<rcode> entries.