MaxDocCount for term aggregation

vladimirdjuricic commented 3 years ago

This was already requested (github issues) couple of times and discussed on forums. I had the need to implement this on my own.

This feature is required in the cases when I want to apply range on buckets to be retrieved. Bucket Selector Aggregation cannot be used here when there are millions of buckets due to fact that all buckets are retrieved before applying the selector.

Facts:

It works for me, but with some glitches I would like to fix
This is WIP. I just started working on this and I would be happy to receive some help from anyone willing to participate in this.

Help needed to discover the way to make the results respect "requiredSize" (size) option. Imagine there are buckets with size of 200000 all the way to zero. Currently, when I set like this (see agg below), buckets with the count greater than 10000 are taken in account but I get no results. To get the results I need to set the "size" to some greater number (e.g 15000). But, results will be within the range. I guess that there should be some decrement (or filter on increment) applied for total buckets found.

If someone could point me to the right direction, I could implement this faster most probably. In any case I will continue with the experiment. I am satisfied with the result. Like I said, bucket selector cannot stand too many buckets but this approach looks fine (except that issue with size)

{ "terms": { "field": "applicant.applicant_url", "size":1000, "max_doc_count": 10000, "min_doc_count": 1000 }

Issued PR: https://github.com/elastic/elasticsearch/pull/74752

elasticmachine commented 3 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

wchaparro commented 3 years ago

hey @vladimirdjuricic thanks for this, can you provide us with some use cases that we can use to make a business justification for these changes? Thank you!

vladimirdjuricic commented 3 years ago

Hi @wchaparro, thanks for your reply. Hope this helps. Let me know should you need more details.

Environment:

100M+ documents with high cardinality (10M+) on a unique company name (term: company_unique_name)
Highest aggregation count: ~200000
Lowest aggregation count: 0
Most frequent aggregation count: ~5000 - ~10000

Requirement:

As an User I would like to get first n "company_unique_name"s (size) where document count (company_unique_name term aggregation) is between A (min_doc_count) and B (max_doc_count).

Scenario 1: Give me up to 1000 companies that are having document count between 20000 and 30000

Dataset potentially available: 50000 aggregations starting with 20189 and ending with 29589
To be returned: first 1000 aggregations starting with 29589 and ending with 25771 doc count

Scenario 2: Give me up to 100 companies that are having document count less and equal than 10000

Dataset potentially available: e.g. 50000 aggregations starting with 10000 count and ending with 0
To be returned: first 100 aggregations starting with 10000 and ending with 9999 count

Thanks again!

iverase commented 3 years ago

This is a hard problem to solve in a distributed system. Let's imagine you have two shards, in one shard company A has 1 million docs while in the other shard this company has 25 thousand docs.

In your implementation, when searching for companies having between 20 and 30 thousand docs the bucket for company A in the first shard is discarded but it is added in the result for the second shard . I believe the final result will contain company A due to this which is not the right result.

The rare_terms aggregation tries to solve this issue by using a cuckoo filter that holds terms that has been discarded. Still in order to avoid memory problem, the number of max doc count is limited to 100.

In summary: 1) To implement this functionality, you should follow something like the rare_terms aggregation instead of terms aggregation,

2) I think in many cases you are going to need to have most of the terms in memory so not sure how feasible it is.

vladimirdjuricic commented 3 years ago

Hi, thanks for your replies. @iverase @wchaparro

Example: count less and equal than 10000

Maybe I'm wrong - rare terms aggregation is handling rare terms. These that I am handling are not rare terms at all.
Company A have 5000 in shard 1 and 11000 in shard 2. Question: Why would we accept final result if results (sum) from both shards are not complying with the limits that are set (-lte 10000). If any of shards is returning more than it is set, key should be marked as undesirable and move on, or if sum of those are greater than limit also flag that one and move on...
Single shard feature only - let's imagine we are using single shard only. This feature is useful with the changes that are already commited (few tweaks needed tho). I would like to have it at least on single shard setup. In future, solution for distributed setup would be useful too. But for a single shard - good starting point, IMO.

For the point 2 I would just need to find the way to resolve the issue from the original description of the issue: "To get the results I need to set the "size" to some greater number (e.g 15000). But, results will be within the range. I guess that there should be some decrement (or filter on increment) applied for total buckets found."

Thanks

elastic / elasticsearch