Closed johnhtodd closed 3 years ago
To be clear, we're talking about answers
, auth4-answers
and auth6-answers
as far as I am concerned.
In the "simple" version, the buckets would become
0-1, 1-10, 10-50, 50-100, 100-1000, slow
what buckets would you like ideally? If we are going to change this we might as well do it good.
Sadly also the SNMP MIB has to be adapted.
On timer buckets: I like equivalent features, if possible, so I'll offer an example that is slightly better than what PDNS uses today. Unbound uses a doubling metric starting at .001ms, which at the bottom end looks a bit un-necessary and is almost always empty until one reaches the ~.1ms mark and then at the top end might be described as too large (though I have no real complaints with their values; it's just a mild distaste for such large results.) I've enclosed a snapshot of a grafana example from unbound counters from a single one of our more "remote" Unbound instances to show what the distribution looks like in reality. This is a question with no good answer. If I had more than a few seconds in the next week to look at this, I'd actually look at an hour or two worth of timers of what average response times look like, and then build a curve of those response times with number of responses as the Y value, and then divide the curve into perhaps 16 different "buckets", all having the same number of responses. This would only reflect "my" view of the world, of course, and others may have different results. It may be useful to round up/round down to whole numbers or modulo 10 numbers to make things more human-friendly. Having the higher number of buckets has actually been quite useful as we see large authoritative resolvers change routing or become more slow/fast by observing these graphs over time, along with many other secondary indicators of trouble which are obvious only with thinner divisions of charting.
Implemented by #10122
Short description
dnsdist has a latency statistic bucket of 50-100ms, while powerdns-recursor does not. Having this additional bucket would be useful, as this particular 100ms span of latency is very important for understanding improvements or behavior changes in recursor-to-authoritative responses. (actually, having many more statistics buckets in the 1-100ms span would be useful, given that most of the internet is less than 100ms wide when looking at major authoritative anycast distribution, but I'll be happy just with a 50-100ms bucket being created. After all, the point of these statistics is to allow administrators to solve problems like improving latency, and if the improvements can't be seen because the bucket size is too coarse, then it makes life difficult for everyone.)
Usecase
Implicitly described in description. This would be consumed by time-series monitors.
Description
Any/all metrics that output statistical collection summaries on latency would need to reflect the new bucket. This would cause the old 10-100ms bucket to be invalid from a naming perspective, so would be a breaking change. (If we're doing a breaking change, we might as well add more buckets so this is a one-time thing, but I'll get off my soapbox now.)