Closed sopel closed 11 years ago
Percentiles are notably absent in the UI, so here's how my notion of this being available in Librato might have come to be:
Accordingly, here's a later 'feature request' around Percentile source aggregation' - the answer is slightly promising, but nothing to hold your breath for:
We are planning to improve built-in support for percentiles in the future. In the interim your best option is to submit the percentile as its own metric (some of our libraries already do this) and then use our new summary statistic support: http://blog.librato.com/posts/2013/6/3/fine-grained-access-to-summary-statistics
The Graphite story around percentile looked slightly better at first, but seems to fall short regardless:
Hosted Graphite seems to offer what one would expect accordingly by means of ':90pct :95pct :99pct etc.' aggregrations:
Want arbitrary percentile data? Just add the number after the colon followed by ‘pct’. It accepts values from 01 to 99. If you want 100th percentile you should be using ”:max”!
There are disturbingly few posts covering this though, one of which is graphite's derivative function lies from January 2013, thus opening another can of worms (not analyzed/verified as such yet) - Abe Hassan's comment covers percentiles too:
My suspicion is that Graphite's concept of percentiles is related to the data points it has stored. So it's not the 90th percentile at that point, but rather the 90th percentile of the data in the metric. To get 90th percentile at a given point in time, I would use statsd, which can calculate that and emit it to Graphite.
So there's a percentile at a point in time, and then a percentile across all time (or across the last X data points). I suspect Graphite is doing the latter. Technically valid, but super duper confusing.
- :exclamation: Please note that similar to Librato above, once again StatsD would be the 'percentiles bridge' here, i.e. relying on the reporting backend rather than the metrics service to handle this.
:information_source: It's worth noting that Coda Hale's excellent Metrics library specifically mentions and addresses the subtleties with percentiles/quantiles in low-latency services, see Histograms:
Traditionally, the way the median (or any other quantile) is calculated is to take the entire data set, sort it, and take the value in the middle (or 1% from the end, for the 99th percentile). This works for small data sets, or batch processing systems, but not for high-throughput, low-latency services.
The solution for this is to sample the data as it goes through. By maintaining a small, manageable reservoir which is statistically representative of the data stream as a whole, we can quickly and easily calculate quantiles which are valid approximations of the actual quantiles. This technique is called reservoir sampling.
Metrics provides a number of different Reservoir implementations, each of which is useful.
[...]
- :question: @dpb587 - With Elasticsearch being written mostly in Java, this library might be used already or could be integrated via a plugin eventually (haven't looked into it yet)?
None yet other than the need to look into our options around percentile support more thoroughly soon, with a workaround eventually being reporting backend aggregation via StatsD or Metrics - let's talk about it in the upcoming hangout.
@dpb587 - the elasticsearch-metrics plugin seems to indicate that Metrics isn't used/available in Elasticsearch by default, but via this plugin as a starting point; this would yield more work though in case we go down this route:
:information_source: It's also worth noting that Metrics supports reporting to JMX too, which would in turn allow to surface selected stock or custom JMX metrics in other tools (New Relic dose support this for example, see Custom JMX monitoring by YAML).
@sopel; thanks for this excellent bit of research. A couple of things stand out for me:
Closed as Incomplete due to analysis considered being sufficient, yet a solution along the lines of percentile calculations at metrics source as concluded above not being feasible resp. a priority right now.
:information_source: @mrdavidlaing - regarding #111 it's worth noting that Riemann seems to support percentiles (see instrumentation.clj), thus might allow to address this at the event sink rather than the source - it lacks docs currently, so not sure about the subtleties mentioned above, but given the pronounced background and attention to detail of Riemann's author Kyle Kinsbury (see e.g. Timelike: a network simulator) I'd expect this to be a solid attempt at least.
@mrdavidlaing just discovered that we've missed to specify percentile support as a metrics reporting/visualization requirements:
I recall some notion of percentiles in Librato, but given the lack of presence in the UI, there must be a flaw, am looking into it right now.