elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.52k stars 24.61k forks source link

Terms agg: calculate aggs on 'other' bucket #12411

Closed j0hnsmith closed 7 months ago

j0hnsmith commented 9 years ago

The terms aggregation now provides an 'other' bucket with a count, I'd like to see the same aggregations performed on the 'other' bucket. Eg if I'm doing a stats aggregation I have stats (sum) for docs with term foo and bar but not for docs where the field is missing or has a null value.

This is really important for analytics type services as all the values must add up to 100% of the total.

There's quite a bit of discussion about it in #5324.

clintongormley commented 9 years ago

@j0hnsmith calculating sub-aggs on an other bucket requires two round trips. The first calculates the top-ten terms (plus their sub-aggs). The second calculates the sub-aggs on everything except the top-ten terms.

To support this in Elasticsearch, we'd need to implement #12316 first. However this is something you can do yourself today.

j0hnsmith commented 9 years ago

I know there are workarounds, but with every level of sub aggregation they get progressively more complex, this could simplify some very complex queries.

vivekmoosani commented 8 years ago

+1

PaulGrandperrin commented 8 years ago

+1

powermik commented 8 years ago

+1

dynomeat commented 8 years ago

+1

EdwardKaravakis commented 8 years ago

+1

markharwood commented 6 years ago

cc @elastic/es-search-aggs

timroes commented 5 years ago

I just wanted to check on in this issue and ping the current team @elastic/es-analytics-geo (since team names seem to have been changed).

We currently use that workaround described above in Kibana to calculate the "Other bucket" for terms aggregations and it's causing us a lot of pain. For one thing, it's the only thing we need to do a second request to gather all information needed for a visualization to render, thus needing some special handling in our infrastructure. Also since that "Other bucket" itself is not really a bucket in the terms of ES, we need a lot of special handling for those, e.g. the filter creation logic need to handle them individually. We also see a lot of issues where (our code) doesn't work properly when having more complex aggregation configurations for your visualization. And last but not least, since they are not real buckets we can never make them properly work with Bucket Script or Bucket Selector, which we want to implement in the future, but then need some special handling or disable them for Other again.

Having the "Other bucket" feature available in Elasticsearch (most importantly for Terms, but we've also seen user asking for it on the Filters Agg and Significant Terms aggs, and I assume in the future also on Rare Terms agg), would be one of the really huge wins for Kibana visualizations and its infrastructure. If there is anything we can support you with, please let me know.

polyfractal commented 5 years ago

Not overly familiar with the issue, but is there a reason that the missing aggregation and/or missing parameter on terms agg aren't sufficient?

E.g. the missing aggregation can go next to the terms aggregation and give you all the documents that don't have the particular field which is being aggregated in the terms agg, and allow you to perform sub-aggs there.

Alternatively, you can set the missing param on a terms agg to something unique (__$MISSING$__ or whatever), and then use that bucket for sub-aggregations.

timroes commented 5 years ago

I think there is some confusing between missing and "Other documents" here :-) Missing is a bucket containing all the documents not having a value in that specific field and instead they will get that missing value. Other bucket should contain all documents not returned in as a bucket otherwise. These can totally fine have values for that field, but just not be under the top x terms requested. So if you request the top 5 terms, and a document is not in one of those buckets it should be in the other bucket. It could still be that this document has a value for that field, just not one of the top 5 common ones.

I can also explain shortly why that was such a highly requested feature in Kibana for a long time, that we decided to implement in (in the above described not very stable way) on our side: Pie charts :-)

If you want to draw the top 5 countries in a pie chart, without the other value, they would always make up 100% of the chart. That can be very confusing depending on what you're trying to visualize, because those countries are not making up at all 100% of the data, but maybe just 30%. Users want to see that, by having 70% drawn as others and these top 5 values as individual slices. Similar things apply for other chart types as well.

Just FYI: we also expose the Missing setting and use as you described above internally a unique identifier for that bucket, so we can find it later, but this is a completely different functionality:

screenshot-20190702-095745

polyfractal commented 5 years ago

Gotcha, makes sense :+1: . Thanks for the extra details @timroes. Mark added the team-discuss label to this, so we'll chat about it in the next meeting.

polyfractal commented 5 years ago

Chatted in analytics meeting, and unfortunately we're at the same roadblock as four years ago. We can't calculate an "other" bucket before we know the global top-n results... and at that point it's too late to build an "other" bucket because we are reducing on the coordinating node. To do this we need two-pass/multi-pass support in the aggregation framework which doesn't exist today (although we have been talking about how we could potentially do it). First pass to find top-n, second pass to collect everything else that wasn't in the top-n into a bucket.

How is Kibana performing the two-passes today? What does the structure of the second query look like (you can point me at code too, that's fine)? Trying to see if there is something we can do to help make the second pass easier until the agg framework has true multi-pass capability. Is it a filter agg for must_not: <top-n terms> or similar?

As a side note, when/if Jim's API (https://github.com/elastic/elasticsearch/issues/26472) is implemented, external two-pass implementations will at least have a consistent view of the index which will make the situation a little better.

timroes commented 5 years ago

Yes Kibana will basically build a must_not: <top-n terms> from the first request.

Here is an example query where I enabled Other Bucket for two nested term queries with one sum metric:

Actual query ```json { "aggs": { "2": { "terms": { "field": "user", "order": { "1": "desc" }, "size": 5 }, "aggs": { "1": { "sum": { "field": "number" } }, "3": { "terms": { "field": "state", "order": { "1": "desc" }, "size": 1 }, "aggs": { "1": { "sum": { "field": "number" } } } } } } }, "size": 0, "_source": { "excludes": [] }, "stored_fields": [ "*" ], "script_fields": { "is_bug": { "script": { "source": "return doc['labels'].contains('bug') ? 1 : 0", "lang": "painless" } } }, "docvalue_fields": [ { "field": "closed_at.time", "format": "date_time" }, { "field": "created_at.time", "format": "date_time" }, { "field": "last_crawled_at", "format": "date_time" }, { "field": "updated_at.time", "format": "date_time" } ], "query": { "bool": { "must": [], "filter": [ { "match_all": {} } ], "should": [], "must_not": [] } } } ```
Level 1 (user) other bucket query ```json { "aggs": { "other-filter": { "aggs": { "1": { "sum": { "field": "number" } }, "3": { "terms": { "field": "state", "order": { "1": "desc" }, "size": 1 }, "aggs": { "1": { "sum": { "field": "number" } } } } }, "filters": { "filters": { "": { "bool": { "must": [ { "exists": { "field": "user" } } ], "filter": [], "should": [], "must_not": [ { "match_phrase": { "user": { "query": "spalger" } } }, { "match_phrase": { "user": { "query": "nreese" } } }, { "match_phrase": { "user": { "query": "cjcenizal" } } }, { "match_phrase": { "user": { "query": "kibanamachine" } } }, { "match_phrase": { "user": { "query": "stacey-gammon" } } } ] } } } } } }, "size": 0, "_source": { "excludes": [] }, "stored_fields": [ "*" ], "script_fields": { "is_bug": { "script": { "source": "return doc['labels'].contains('bug') ? 1 : 0", "lang": "painless" } } }, "docvalue_fields": [ { "field": "closed_at.time", "format": "date_time" }, { "field": "created_at.time", "format": "date_time" }, { "field": "last_crawled_at", "format": "date_time" }, { "field": "updated_at.time", "format": "date_time" } ], "query": { "bool": { "must": [], "filter": [ { "match_all": {} } ], "should": [], "must_not": [] } } } ```
Level 2 (State) other bucket query ```json { "aggs": { "other-filter": { "aggs": { "1": { "sum": { "field": "number" } } }, "filters": { "filters": { "-spalger": { "bool": { "must": [ { "match_phrase": { "user": { "query": "spalger" } } }, { "exists": { "field": "state" } } ], "filter": [], "should": [], "must_not": [ { "match_phrase": { "state": { "query": "closed" } } } ] } }, "-nreese": { "bool": { "must": [ { "match_phrase": { "user": { "query": "nreese" } } }, { "exists": { "field": "state" } } ], "filter": [], "should": [], "must_not": [ { "match_phrase": { "state": { "query": "closed" } } } ] } }, "-cjcenizal": { "bool": { "must": [ { "match_phrase": { "user": { "query": "cjcenizal" } } }, { "exists": { "field": "state" } } ], "filter": [], "should": [], "must_not": [ { "match_phrase": { "state": { "query": "closed" } } } ] } }, "-kibanamachine": { "bool": { "must": [ { "match_phrase": { "user": { "query": "kibanamachine" } } }, { "exists": { "field": "state" } } ], "filter": [], "should": [], "must_not": [ { "match_phrase": { "state": { "query": "closed" } } } ] } }, "-stacey-gammon": { "bool": { "must": [ { "match_phrase": { "user": { "query": "stacey-gammon" } } }, { "exists": { "field": "state" } } ], "filter": [], "should": [], "must_not": [ { "match_phrase": { "state": { "query": "closed" } } } ] } }, "-__other__": { "bool": { "must": [ { "exists": { "field": "state" } } ], "filter": [], "should": [], "must_not": [ { "bool": { "should": [ { "match_phrase": { "user": "spalger" } }, { "match_phrase": { "user": "nreese" } }, { "match_phrase": { "user": "cjcenizal" } }, { "match_phrase": { "user": "kibanamachine" } }, { "match_phrase": { "user": "stacey-gammon" } } ], "minimum_should_match": 1 } }, { "match_phrase": { "state": { "query": "closed" } } } ] } } } } } }, "size": 0, "_source": { "excludes": [] }, "stored_fields": [ "*" ], "script_fields": { "is_bug": { "script": { "source": "return doc['labels'].contains('bug') ? 1 : 0", "lang": "painless" } } }, "docvalue_fields": [ { "field": "closed_at.time", "format": "date_time" }, { "field": "created_at.time", "format": "date_time" }, { "field": "last_crawled_at", "format": "date_time" }, { "field": "updated_at.time", "format": "date_time" } ], "query": { "bool": { "must": [], "filter": [ { "match_all": {} } ], "should": [], "must_not": [] } } } ```

In general you can see those requests when doing a terms aggregation for a visualization in Kibana, enable Other Buckets on that, and then use the Inspect button on top and switch from the tabular data view to the request view. It will show all requests done.

In general the implementation itself works that every aggregation can execute a post flight request, and the post flight request for terms can be found in the terms.js file with most of the actual merging and filtering logic happening in _terms_other_bucket_helper.js. If you inspect that code you'll also get a good feeling why I would prefer that logic to live inside Elasticsearch :-)

wchaparro commented 7 months ago

closing as not planned.