elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.97k stars 24.75k forks source link

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Closed mrec closed 3 years ago

mrec commented 7 years ago

NOTE: This is very similar to #9572 about histogram, and is probably similarly blocked by #12316, but I didn't want to hijack that one given that it's a different aggregation. I also couldn't see any mention of the "swamping" problem described below as the second motivation.

A range aggregation is specified using an array of explicit bucket ranges:

"range" : {
    "field" : "price",
    "ranges" : [
        { "to" : 50 },
        { "from" : 50, "to" : 100 },
        { "from" : 100 }
    ]
}

I'm proposing an alternative which just requests a maximum number of buckets to return; syntax is open to bikeshedding but could be e.g.

"range" : {
    "field" : "price",
    "ranges" : 3
}

The motivation for this is twofold:

  1. One very common use case for range aggs is creating drilldown UI to augment a primary text-based search. The end-user doesn't enter the range boundaries; rather, the client application will do a calibration request with something like a percentiles agg, aiming for N buckets with doc counts as even as possible, and set the range bucket boundaries for the main request based on the results of that. The problem here is duplicate work; it means two round-trips to Elastic rather than one, both trips will need to apply the request's filtering criteria, and the dependency on those criteria means that the result of the percentiles agg can't usefully be cached. Doing the percentiles internally on the server would avoid the network overhead of the second call, and could hopefully avoid repeating the document filtering step.
  2. Occasionally the matching docs are swamped by one value, meaning that (for example) requesting [25,50,75] percentiles can return the same value for all three making it almost useless for drilldown. We can work around this when using the Java client by binary-chopping the returned Percentiles until you find something useful, but this isn't possible via the REST API and won't be possible via the Java client either (see #23610) once it switches to REST transport.
markharwood commented 7 years ago

As you suggest this is likely blocked because we would need multiple phases to select the appropriate bucketing strategy for use across multiple indices/shards.

Putting aside the challenges of implementing this efficiently, I imagine there would need to be some extra client criteria to control the aesthetics of derived buckets. Given a website that sells spoons, cars and houses they would need flexible bucketing depending on the individual queries i.e. tiers of rounding (prices snap to either units of 1, 10, 1,000, 10,000, 100,000). Often such sites implement "category snapping" to determine not just appropriate ranges of prices but other category-appropriate facets such as "transmission types" apply for cars, "number of bedrooms" apply for houses etc. In these situations it makes more sense to define a whole set of categories and the appropriate aggregations for them and implement a query-pre-processing system to determine which is the most probable category for a given query.

colings86 commented 6 years ago

This will be solved by https://github.com/elastic/elasticsearch/pull/28993 when its merged so closing in favour of https://github.com/elastic/elasticsearch/issues/9572 which is the issue for that change

mrec commented 6 years ago

Hi @colings86 - it's great to see that work happening, but from the PR description I'm not sure it addresses this request. Your change seems to be specifically about histograms, and is constrained to keep all intervals the same width. This ticket was about range for faceted navigation, where it's sometimes more valuable for all intervals to contain roughly the same doc count and you don't necessarily care about them being the same width. See also @markharwood's example of "prices snap to either units of 1, 10, 1,000, 10,000, 100,000" .

If I'm misunderstanding, apologies; I've only read the description, not the code diffs.

colings86 commented 6 years ago

@mrec sorry, you are right, your use case is slightly different so I'll reopen this issue.

Off the back of #28993 I was actually thinking about using a similar approach of merging buckets at collection time to create an aggregation for constant-density histograms where we would aim to create buckets that have roughly the same number of documents in them. I am yet to convince myself whether this is just a case of being able to run k-means clustering on the numeric values though so it might be solved by https://github.com/elastic/elasticsearch/issues/5512

talevy commented 3 years ago

Assuming that it is sufficient to have these N ranges be variable and do not need to be uniform ranges, then this comment suggests the Variable Width Histogram is a great alternative.

feel free to re-open if that is not the case, and the ranges must be uniform.

polyfractal commented 3 years ago

Related (constant width auto histo): https://github.com/elastic/elasticsearch/issues/31828