Open johnhtodd opened 6 years ago
Also the docs should document all of that.
More thinking on this: should the counters towards clients be expanded and named so that the bound address/port/protocol is included in the results? examples: "replies-0.0.0.0-53-UDP-noerror" or "replies-packetcache-2001:db8::a3-53-TCP-noerror" Most sites have at least v4 and v6, and being able to distinguish between them would seem useful. Our site has many IP addresses associated with a single dnsdist instance, and getting good response statistics on a per address/port/protocol basis seems like a reasonable extension to this idea if there is going to be overhaul of the code.
Queries(/responses) that are dropped because of a ResponseAction are partially accounted:
Queries that hit a pool with no (up) servers will increase "no-policy", but there is no separate statistics bucket for muted UDP clients.
Some notes (for myself mostly):
pushed some initial code to https://github.com/zeha/pdns/tree/dnsdist-stats
zeha - Who else would be interested in this do you think? I'd like to get some more input on if this is a useful fix. Does rgacogne's recent patch of https://github.com/PowerDNS/pdns/pull/6563 have add a previously un-monitored dimension to this? (drops)
Unsure who else would be interested (at least publicly).
drops in #6563 are per downstream server; if we do backend-responses-<server>...
, then we should look at that too.
Note that the drop rate is the number of reused
per seconds over the last period, so this can easily be derived from reused
even before #6563, if needed.
Program: dnsdist
Issue type: Feature request
Short description
For dnsdist, we believe that we’re not counting all of the replies back to clients that we’re actually sending. Quick examination by zeha of the code may actually support that, in particular looking at “servfail” replies which are generated from packetcache do not seem to be stored in any counters that are obvious. The current list of information from the dnsdist stats (https://dnsdist.org/statistics.html#responses) may be incomplete, or at least is not obvious when attempting to ascertain the true response rates of various rcodes to clients. Additionally, we require a more detailed breakdown of what type of reply rcodes are actually being sent to clients, and where those replies are coming from within dnsdist’s process. This allows us to understand behaviors of dnsdist at large scale, and across differing client and network conditions.
Description
Here is the list of current (1.2) items that seem to be relevant when counting replies or client-side data:
cache-hits Number of times an answer was retrieved from cache.
cache-misses Number of times an answer was not found in the cache.
responses Number of responses received from backends.
rule-nxdomain Number of NXDomain answers returned because of a rule.
rule-refused Number of Refused answers returned because of a rule.
self-answered Number of self-answered responses.
servfail-responses Number of servfail answers received from backends.
What was discussed on IRC was possibly breaking up various rcode types into more specific statistics counters that consider replies of each rcode type and where the response was originated for the purposes of better understanding dnsdist performance.
New naming and rcode statistics proposal:
replies-<rcode> Number of responses sent to clients of this rcode type (total)
replies-packetcache-<rcode> Number of responses sent to clients of this rcode type that were from packetcache
replies-rule-<rcode> Number of responses sent to clients from dnsdist internal rules
backend-responses-<rcode> Number of responses received from backend servers of this rcode type
The \<rcode> value would be taken from the IANA list and would be the “human-readable” name in lower case such as “noerror” or “refused”. (https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml) Currently, values in the range of 0-15 are interesting, but having the extended rcodes would also be useful as they become adopted. Replies with rcodes not matching the existing IANA-recognized list could be ignored, or put into a single bucket, or their numeric values could be used as the rcode summary (this is left unspecified - someone else with other requirements may need this, but not us.)
It may be the case that even further breakdown of responses from backends might be useful, for example creating “backend-responses-\<backend-name>-\<rcode>” so that each backend system could be evaluated separately, and measured for performance. This is a non-trivial expansion of the original concept of simply collecting rcode statistics to clients, and perhaps may be a separate ticket.
By creating these new buckets, several existing statistics would probably be renamed so that they would match the new structure:
rule-refused -> replies-rule-refused
rule-nxdomain -> replies-rule-nxdomain
servfail-responses -> backend-responses-servfail
Additionally, it would seem to follow that these statements would be true (please correct me if this is not the case!)
responses
would equal the sum of allbackend-responses-<rcode>
entries.cache-hits
would equal the sum of allreplies-packetcache-<rcode>
entries.replies-<rcode>
entries.