cityindex-attic / logsearch

[unmaintained] A development environment for ELK
Apache License 2.0
24 stars 8 forks source link

Analyze/Integrate metrics percentiles #101

Closed sopel closed 11 years ago

sopel commented 11 years ago

@mrdavidlaing just discovered that we've missed to specify percentile support as a metrics reporting/visualization requirements:

One thing I'm concerned with is that I can only see how to plot averages, mins & maxes, but not percentiles (like 98% percentile) or distributions. For latency measuring this is a bit of a deal breaker; since latency distributions are NOT normally distributed - [...]

I recall some notion of percentiles in Librato, but given the lack of presence in the UI, there must be a flaw, am looking into it right now.

sopel commented 11 years ago

Initial Analysis

Librato

Percentiles are notably absent in the UI, so here's how my notion of this being available in Librato might have come to be:

Graphite

The Graphite story around percentile looked slightly better at first, but seems to fall short regardless:

Subtleties/ Alternatives

:information_source: It's worth noting that Coda Hale's excellent Metrics library specifically mentions and addresses the subtleties with percentiles/quantiles in low-latency services, see Histograms:

Traditionally, the way the median (or any other quantile) is calculated is to take the entire data set, sort it, and take the value in the middle (or 1% from the end, for the 99th percentile). This works for small data sets, or batch processing systems, but not for high-throughput, low-latency services.

The solution for this is to sample the data as it goes through. By maintaining a small, manageable reservoir which is statistically representative of the data stream as a whole, we can quickly and easily calculate quantiles which are valid approximations of the actual quantiles. This technique is called reservoir sampling.

Metrics provides a number of different Reservoir implementations, each of which is useful.

[...]

  • :question: @dpb587 - With Elasticsearch being written mostly in Java, this library might be used already or could be integrated via a plugin eventually (haven't looked into it yet)?

Conclusion

None yet other than the need to look into our options around percentile support more thoroughly soon, with a workaround eventually being reporting backend aggregation via StatsD or Metrics - let's talk about it in the upcoming hangout.

sopel commented 11 years ago

@dpb587 - the elasticsearch-metrics plugin seems to indicate that Metrics isn't used/available in Elasticsearch by default, but via this plugin as a starting point; this would yield more work though in case we go down this route:

sopel commented 11 years ago

:information_source: It's also worth noting that Metrics supports reporting to JMX too, which would in turn allow to surface selected stock or custom JMX metrics in other tools (New Relic dose support this for example, see Custom JMX monitoring by YAML).

mrdavidlaing commented 11 years ago

@sopel; thanks for this excellent bit of research. A couple of things stand out for me:

  1. Percentiles need to be calculated at "shipping" source, rather than at "graphing" dashboard
  2. New Relic can plot custom metrics (eg JMX); so we might be able to use it for all our LogSearch cluster monitoring needs.
  3. LogStash+StatsD -> Kibana with some graph extensions might be a viable Graphite competitor.
sopel commented 11 years ago

Closed as Incomplete due to analysis considered being sufficient, yet a solution along the lines of percentile calculations at metrics source as concluded above not being feasible resp. a priority right now.

sopel commented 11 years ago

:information_source: @mrdavidlaing - regarding #111 it's worth noting that Riemann seems to support percentiles (see instrumentation.clj), thus might allow to address this at the event sink rather than the source - it lacks docs currently, so not sure about the subtleties mentioned above, but given the pronounced background and attention to detail of Riemann's author Kyle Kinsbury (see e.g. Timelike: a network simulator) I'd expect this to be a solid attempt at least.