Add/harmonize a 50-100ms timer bucket to recursor stats

johnhtodd commented 4 years ago

Program: Recursor
Issue type: Feature request

Short description

dnsdist has a latency statistic bucket of 50-100ms, while powerdns-recursor does not. Having this additional bucket would be useful, as this particular 100ms span of latency is very important for understanding improvements or behavior changes in recursor-to-authoritative responses. (actually, having many more statistics buckets in the 1-100ms span would be useful, given that most of the internet is less than 100ms wide when looking at major authoritative anycast distribution, but I'll be happy just with a 50-100ms bucket being created. After all, the point of these statistics is to allow administrators to solve problems like improving latency, and if the improvements can't be seen because the bucket size is too coarse, then it makes life difficult for everyone.)

Usecase

Implicitly described in description. This would be consumed by time-series monitors.

Description

Any/all metrics that output statistical collection summaries on latency would need to reflect the new bucket. This would cause the old 10-100ms bucket to be invalid from a naming perspective, so would be a breaking change. (If we're doing a breaking change, we might as well add more buckets so this is a one-time thing, but I'll get off my soapbox now.)

omoerbeek commented 3 years ago

To be clear, we're talking about answers, auth4-answers and auth6-answers as far as I am concerned. In the "simple" version, the buckets would become 0-1, 1-10, 10-50, 50-100, 100-1000, slow what buckets would you like ideally? If we are going to change this we might as well do it good. Sadly also the SNMP MIB has to be adapted.

johnhtodd commented 3 years ago

On timer buckets: I like equivalent features, if possible, so I'll offer an example that is slightly better than what PDNS uses today. Unbound uses a doubling metric starting at .001ms, which at the bottom end looks a bit un-necessary and is almost always empty until one reaches the ~.1ms mark and then at the top end might be described as too large (though I have no real complaints with their values; it's just a mild distaste for such large results.) I've enclosed a snapshot of a grafana example from unbound counters from a single one of our more "remote" Unbound instances to show what the distribution looks like in reality. This is a question with no good answer. If I had more than a few seconds in the next week to look at this, I'd actually look at an hour or two worth of timers of what average response times look like, and then build a curve of those response times with number of responses as the Y value, and then divide the curve into perhaps 16 different "buckets", all having the same number of responses. This would only reflect "my" view of the world, of course, and others may have different results. It may be useful to round up/round down to whole numbers or modulo 10 numbers to make things more human-friendly. Having the higher number of buckets has actually been quite useful as we see large authoritative resolvers change routing or become more slow/fast by observing these graphs over time, along with many other secondary indicators of trouble which are obvious only with thinner divisions of charting. Screen Shot 2021-02-10 at 7 58 17 AM

omoerbeek commented 3 years ago

Implemented by #10122

PowerDNS / pdns