Open vladimirdjuricic opened 3 years ago
Pinging @elastic/es-analytics-geo (Team:Analytics)
hey @vladimirdjuricic thanks for this, can you provide us with some use cases that we can use to make a business justification for these changes? Thank you!
Hi @wchaparro, thanks for your reply. Hope this helps. Let me know should you need more details.
As an User I would like to get first n "company_unique_name"s (size) where document count (company_unique_name term aggregation) is between A (min_doc_count) and B (max_doc_count).
Thanks again!
This is a hard problem to solve in a distributed system. Let's imagine you have two shards, in one shard company A has 1 million docs while in the other shard this company has 25 thousand docs.
In your implementation, when searching for companies having between 20 and 30 thousand docs the bucket for company A in the first shard is discarded but it is added in the result for the second shard . I believe the final result will contain company A due to this which is not the right result.
The rare_terms aggregation tries to solve this issue by using a cuckoo filter that holds terms that has been discarded. Still in order to avoid memory problem, the number of max doc count is limited to 100.
In summary:
1) To implement this functionality, you should follow something like the rare_terms
aggregation instead of terms
aggregation,
2) I think in many cases you are going to need to have most of the terms in memory so not sure how feasible it is.
Hi, thanks for your replies. @iverase @wchaparro
Example: count less and equal than 10000
For the point 2 I would just need to find the way to resolve the issue from the original description of the issue: "To get the results I need to set the "size" to some greater number (e.g 15000). But, results will be within the range. I guess that there should be some decrement (or filter on increment) applied for total buckets found."
Thanks
This was already requested (github issues) couple of times and discussed on forums. I had the need to implement this on my own.
Related:
This feature is required in the cases when I want to apply range on buckets to be retrieved. Bucket Selector Aggregation cannot be used here when there are millions of buckets due to fact that all buckets are retrieved before applying the selector.
Facts:
Help needed to discover the way to make the results respect "requiredSize" (size) option. Imagine there are buckets with size of 200000 all the way to zero. Currently, when I set like this (see agg below), buckets with the count greater than 10000 are taken in account but I get no results. To get the results I need to set the "size" to some greater number (e.g 15000). But, results will be within the range. I guess that there should be some decrement (or filter on increment) applied for total buckets found.
If someone could point me to the right direction, I could implement this faster most probably. In any case I will continue with the experiment. I am satisfied with the result. Like I said, bucket selector cannot stand too many buckets but this approach looks fine (except that issue with size)
{ "terms": { "field": "applicant.applicant_url", "size":1000, "max_doc_count": 10000, "min_doc_count": 1000 }
Issued PR: https://github.com/elastic/elasticsearch/pull/74752