elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.06k stars 24.52k forks source link

Add a new quantile histogram aggregation for numeric fields #50386

Open agirbal opened 4 years ago

agirbal commented 4 years ago

This issue is related to https://github.com/elastic/elasticsearch/issues/31828 and to some extent https://github.com/elastic/elasticsearch/pull/28993. It would be more useful for most of our use cases than https://github.com/elastic/elasticsearch/issues/31828 (cc @pcsanwald). I talked about this feature to @smayzak and @AlonaNadler a bit.

Problem: when doing histograms using a numeric value (on the X-axis) it is very common that the distribution of documents is concentrated in a tiny portion of the histogram. A common example if you want to plot against say "user request latency" of a production system, 90+% of them are going to concentrated in the 1st bucket - it is a long tail problem which is common to most production datasets. Trying to filter out higher values is very tedious and still you end up with a histogram distribution of values that is not conducive to making any analysis / conclusions.

Ideal solution: most data analysis (that we base decisions on) instead use a quantile distribution on the X-axis, meaning that each bucket represents an equivalent portion of the data. For example the first bucket would be the 10% users with best "request latency" (call it p0-10), next would be 10-20% best (p10-20), etc and last bucket is my 10% users with worst performance (p90-100). In turn this lets the operator do very clear analysis: "this change in my software is hurting performance by 5% for my 10% best connected users but improves 15% for my p90 users, so it's a very positive change." Each bucket could be either equal in terms of portion of dataset, or better you could just customize the ranges as percentile ranks, just like you do in the percentiles value function.

Workaround: As suggested by @jpountz you can do a pre-flight request to ES to obtain the quantile bucket bounds, then make a second request for a standard histogram with known buckets. I have done this and it works but it is extremely cumbersome and not viable solution really, besides a fun experiment. I had to create a complex HTML form to allow to pick the fields, percentiles, function to apply to Y-axis, etc. Then hack a complex URL query string to generate the Kibana histogram, guaranteed to break. From there the display in Kibana is not really shareable, you can't change time window or any filter without having to redo the whole thing, because the buckets need to be recalculated.

Note there is already Kibana tickets about it https://github.com/elastic/kibana/issues/3905 and https://github.com/elastic/kibana/issues/3757 . But it really seems for this to work seamlessly in Kibana, ES should support it as a native aggregation. Thanks much!

elasticmachine commented 4 years ago

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

talevy commented 4 years ago

It would be great to do this in two passes, on sorted data. blocked on multi-pass aggregation support

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)