Analyze/Integrate metrics percentiles

sopel commented 11 years ago

@mrdavidlaing just discovered that we've missed to specify percentile support as a metrics reporting/visualization requirements:

One thing I'm concerned with is that I can only see how to plot averages, mins & maxes, but not percentiles (like 98% percentile) or distributions. For latency measuring this is a bit of a deal breaker; since latency distributions are NOT normally distributed - [...]

I recall some notion of percentiles in Librato, but given the lack of presence in the UI, there must be a flaw, am looking into it right now.

sopel commented 11 years ago

Initial Analysis

Librato

Percentiles are notably absent in the UI, so here's how my notion of this being available in Librato might have come to be:

There's a percentile related post about Native StatsD integration with Gauges and Percentiles
- :exclamation: Please note that this functionality is tied to the StatsD reporting backend rather than build into the service!
Accordingly, here's a later 'feature request' around Percentile source aggregation' - the answer is slightly promising, but nothing to hold your breath for:

We are planning to improve built-in support for percentiles in the future. In the interim your best option is to submit the percentile as its own metric (some of our libraries already do this) and then use our new summary statistic support: http://blog.librato.com/posts/2013/6/3/fine-grained-access-to-summary-statistics

Graphite

The Graphite story around percentile looked slightly better at first, but seems to fall short regardless:

Graphite has a few dedicated functions for working with percentiles, see e.g. nPercentile and percentileOfSeries
Hosted Graphite seems to offer what one would expect accordingly by means of ':90pct :95pct :99pct etc.' aggregrations:

Want arbitrary percentile data? Just add the number after the colon followed by ‘pct’. It accepts values from 01 to 99. If you want 100th percentile you should be using ”:max”!
There are disturbingly few posts covering this though, one of which is graphite's derivative function lies from January 2013, thus opening another can of worms (not analyzed/verified as such yet) - Abe Hassan's comment covers percentiles too:
My suspicion is that Graphite's concept of percentiles is related to the data points it has stored. So it's not the 90th percentile at that point, but rather the 90th percentile of the data in the metric. To get 90th percentile at a given point in time, I would use statsd, which can calculate that and emit it to Graphite.

So there's a percentile at a point in time, and then a percentile across all time (or across the last X data points). I suspect Graphite is doing the latter. Technically valid, but super duper confusing.
- :exclamation: Please note that similar to Librato above, once again StatsD would be the 'percentiles bridge' here, i.e. relying on the reporting backend rather than the metrics service to handle this.

Subtleties/ Alternatives

:information_source: It's worth noting that Coda Hale's excellent Metrics library specifically mentions and addresses the subtleties with percentiles/quantiles in low-latency services, see Histograms:

Traditionally, the way the median (or any other quantile) is calculated is to take the entire data set, sort it, and take the value in the middle (or 1% from the end, for the 99th percentile). This works for small data sets, or batch processing systems, but not for high-throughput, low-latency services.

The solution for this is to sample the data as it goes through. By maintaining a small, manageable reservoir which is statistically representative of the data stream as a whole, we can quickly and easily calculate quantiles which are valid approximations of the actual quantiles. This technique is called reservoir sampling.

Metrics provides a number of different Reservoir implementations, each of which is useful.

[...]

:question: @dpb587 - With Elasticsearch being written mostly in Java, this library might be used already or could be integrated via a plugin eventually (haven't looked into it yet)?

Conclusion

None yet other than the need to look into our options around percentile support more thoroughly soon, with a workaround eventually being reporting backend aggregation via StatsD or Metrics - let's talk about it in the upcoming hangout.

sopel commented 11 years ago

@dpb587 - the elasticsearch-metrics plugin seems to indicate that Metrics isn't used/available in Elasticsearch by default, but via this plugin as a starting point; this would yield more work though in case we go down this route:

seems to lack percentiles as well and works with Graphite only currently, but claims that Gathering additional metrics (or reporting to other destinations, like Ganglia) is easy, which is usually the case indeed with the abstractions/libraries at hand
creating dynamic metrics is not implemented yet, see collecting statistics for all indices
- :question: @mrdavidlaing - while obviously desired the way Elasticsearch operates, it may not be required for the use case at hand though?

sopel commented 11 years ago

:information_source: It's also worth noting that Metrics supports reporting to JMX too, which would in turn allow to surface selected stock or custom JMX metrics in other tools (New Relic dose support this for example, see Custom JMX monitoring by YAML).

mrdavidlaing commented 11 years ago

@sopel; thanks for this excellent bit of research. A couple of things stand out for me:

Percentiles need to be calculated at "shipping" source, rather than at "graphing" dashboard
New Relic can plot custom metrics (eg JMX); so we might be able to use it for all our LogSearch cluster monitoring needs.
LogStash+StatsD -> Kibana with some graph extensions might be a viable Graphite competitor.

sopel commented 11 years ago

Closed as Incomplete due to analysis considered being sufficient, yet a solution along the lines of percentile calculations at metrics source as concluded above not being feasible resp. a priority right now.

sopel commented 11 years ago

:information_source: @mrdavidlaing - regarding #111 it's worth noting that Riemann seems to support percentiles (see instrumentation.clj), thus might allow to address this at the event sink rather than the source - it lacks docs currently, so not sure about the subtleties mentioned above, but given the pronounced background and attention to detail of Riemann's author Kyle Kinsbury (see e.g. Timelike: a network simulator) I'd expect this to be a solid attempt at least.

cityindex-attic / logsearch