elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.85k stars 24.71k forks source link

Remove support for sorting terms aggregation by ascending count #17614

Closed colings86 closed 3 years ago

colings86 commented 8 years ago

We try to be as flexible as possible when it comes to sorting terms aggregations. However, sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts, which is disappointing to users. For this reason we should remove the ability to sort the terms aggregation by ascending count (split out from https://github.com/elastic/elasticsearch/issues/17588)

rashidkpc commented 8 years ago

If and when you decide if you're going to do this, please open an issue on kibana. We currently support this in the UI so we'll need to issue a deprecation notice well in advance

rashidkpc commented 8 years ago

Oh, also, I'm +1 on this, its confusing and rarely useful anyway. If we're going to deprecate it pre-5.0.0, which I'd prefer, let me know.

jimczi commented 8 years ago

Since this requires some changes in Kibana I've reverted the removal. Though it is now deprecated in 2.x/2.4.

jccq commented 8 years ago

Frankly i know people who were using this actually i'd say rely on this e.g. to spot anomalies, if the only problem is that it wasnt working very well, could you just not have clarified in the docs? i mean so many things in Elasticsearch dont really return the right number

jccq commented 8 years ago

unless there is an alternative for that use cases of course, is there? e.g. what are the least frequent MD5s executed across the logs

djschny commented 8 years ago

For folks that have small datasets this is still useful and does not hurt anything to my knowledge. I do not understand why this is being removed as well. I plead to have this kept and enhance the docs and/or add a check based upon the size or cardinality of the field perhaps.

nich07as commented 8 years ago

I've worked with a few customers who uses the ascending count to determine what are the least popular items and also the occasional outliers. Agree that it might be inaccurate over large datasets but would be helpful on smaller samples and doesn't hurt to keep it around as @djschny suggests.

jccq commented 8 years ago

The point is.. what is the replacement really? sure its inaccurate but is there any other way to achieve something like this? i dont think so. Thus the big damage in removing vs... simply explaining the limitation.

On Thu, Sep 1, 2016 at 3:02 AM, nich07as notifications@github.com wrote:

I've worked with a few customers who uses the ascending count to determine what are the least popular items and also the occasional outliers. Agree that it might be inaccurate over large datasets but would be helpful on smaller samples and doesn't hurt to keep it around as @djschny https://github.com/djschny suggests.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/17614#issuecomment-243955399, or mute the thread https://github.com/notifications/unsubscribe-auth/AEuGyLkuCM3oiAkuqValXtEn93axFcJnks5qljJGgaJpZM4IC2Tl .

Giovanni Tummarello CEO - SIREn Solutions

jayswan commented 8 years ago

We use this extensively to find outliers in log data. Usually in large data sets you can filter out the most common items before performing the aggregation, so loss of accuracy isn't a big problem.

Removing this feature is a huge problem for logging use cases -- nearly crippling, IMO.

jayswan commented 8 years ago

Furthermore: in exploratory log analysis the exact content of the tails often doesn't matter as much as the general kinds of things in the tails.

As an example, I just used this feature to reduce the set "interesting" documents in an index of about 6 million logs from 1.5 million to about 100 over the course of 5 minutes by iteratively excluding categories of things I found in the tails.

henrikjohansen commented 8 years ago

@colings86 @jimferenczi ... this is a rather poor decision for a number of use-cases where finding rare occasions is important. It severely impacts security analytics for example.

colings86 commented 8 years ago

very hard to return the correct top buckets and counts

I want to explain this a bit more as I don't think its really clear on the description above (apologies for that)

The problem here isn't that the counts can be wrong, the problem is that there is currently no bound on how wrong the counts can be (and no way to know what the error might be). To explain this consider the following example.

Imagine we are looking for the top 5 terms in a field across 3 shards ordered by ascending count. The terms aggregation goes and retrieves the top 5 terms from each shard and then merges them together (in practice it actually retrieves more than the top 5 from each shard but for the purposes of this example lets assume size and shard_size are the same). The shards might return the following:

Shard 1 Shard 2 Shard 3
a (1) a (1) a (1)
b (1) c (1) b (1)
d (1) d (1) f (1)
g (2) f (2) i (1)
h (9) g (2) j (2)

When merged on the reduce node this will produce the final list of:

So the final top 5 will be:

Which seems great until you look into the results from the shards a bit closer.

The counts returned from each shard are 100% accurate so if a shard says it has 1 document with the term a it only has one document with the term a. But its the information thats not returned from the shard that leads to issues. From the shard results above we can see that a is returned from every shard so we know that the document count for a is completely accurate. But if we now look at d we can see that it was only returned form shards 1 and 2. We don't know whether Shard 3 doesn't have any documents containing d or whether d just didn't make it into the top 5 terms on Shard 3 (i.e. whether there are 0 or > 2 documents containing d on the shard). It could be that shard 3 happens to have 100 documents containing d but it just wasn't returned in the top N list.

The terms aggregation documentation tries to explain how with descending count ordering we can calculate the worst case error in the doc_count by using the doc_count of the last term returned on each shard. But in the case of ascending order the doc_count a terms could have on a shard that didn't return it could be anything, all we know is that it's either 0 or greater than or equal to the doc_count of the last term returned by each shard.

djschny commented 8 years ago

This is still relevant and accurate when you are searching an index with only 1 shard correct?

clintongormley commented 8 years ago

@djschny we're not building a product that only works on one shard and breaks at scale. This is a really important part of how we make decisions: whatever we build must scale.

nik9000 commented 8 years ago

We don't know whether Shard 3 doesn't have any documents containing d or whether d just didn't make it into the top 5 terms on Shard 3

Maybe rename ascending to ascending_candidates to make it super clear that these might be rare events? I think given this explanation this could still be useful but pretty trap-ish as named.

I think it is interesting that you could use _routing to make this properly accurate again. You'd have to use it when indexing the data and the data has to be amenable to it, but it is a thing.

colings86 commented 8 years ago

Maybe rename ascending to ascending_candidates to make it super clear that these might be rare events? I think given this explanation this could still be useful but pretty trap-ish as named.

Personally I think it's trap-ish regardless of what we name it. If the index has 20 shards and 19 of them have 1000 docs containing a particular term but 1 has only 1 document then the term is likely to be returned as rare even though it is in fact not even slightly rare. That seems very trappy to me whatever we call the option. Consider also in that case you could get into the situation where the term appears in top N list for both doc_count descending and doc_count ascending because with doc_count descending with a doc count of 19000 it could be one of the most popular terms.

On time based indices this situation could be more likely if a term first appears at the end of a indexes time period but is actually very popular. In that case when searching over a period that rides over this period you could be returned the term as rare when in fact it's not rare at all in the context you are looking at

djschny commented 8 years ago

@djschny we're not building a product that only works on one shard and breaks at scale. This is a really important part of how we make decisions: whatever we build must scale.

I can completely understand that for investing time into new features. But this feature is pre-existing and it does have usefulness. If we follow this at such a strict line then we should deprecate and remove terms aggregation in general and also cardinality because they have similar behavior to what is discussed here in regards to accuracy and scalability (as defined here). However I assume (or hope) that those are not in consideration for removal as well.

@colings86 thank you for the detailed explanation, however the behavior described is not what is in question. The topic at hand here is that yes, most people are perfectly fine with all the pitfalls of this. This feature is not perfect and that is fine. It still helps solve problems. And we don't have an alternative to it either.

This is not an issue where it has the potential to affect the health of the cluster (unless I missed that point) like other scalability items (fielddata, large mappings, etc.) do. So to consider this a scalability issue I just don't see.

The most practical thing here seems like a combination of what people have mentioned in this thread. For example:

I'm all for removing things that are truly dangerous for the health of the system/cluster, but want to make sure we do not cut too far. IMO this is a cut too far.

I hope accurately was able to articulate the reasons behind my earlier terse remarks.

jpountz commented 8 years ago

@djschny Even though these aggs are potentially inaccurate, there is a fine line between returning a good estimation and a number that is not unlikely to be completely off. The motivation here was to remove a trap. Now, there seems to be significant push back so I'd be fine to reconsider and solve it by documentation, but please please please do not put in into the same bucket as sorting by descending count or the cardinality aggregation, these are very different issues.

mlawler004 commented 7 years ago

Another vote for keeping this functionality available in some shape or form. There are lots of different uses cases for ES. We use ES on enterprise scale (million(s) of records, not billions), and we use this for outlier detection in some circumstances, We're ok with it not being accurate some times.

anhlqn commented 7 years ago

+1 for keeping this feature around. We use it to look for the few emails that come from random countries that should not be in the picture. There are also many other use cases in log monitoring and analysis.

epixa commented 7 years ago

@clintongormley @colings86 @jimczi Are there any plans to continue with this for 6.0? It looks stalled at the moment, but I want to make sure we remove the feature from Kibana if it is being removed from Elasticsearch as well.

jimczi commented 7 years ago

@epixa there is no plan and it seems that this functionality is important for some use cases. As @jpountz said we can solve this with documentation, explaining that ascending count sort is not accurate and does not give any hint regarding the level of errors.

epixa commented 7 years ago

@jimczi Thanks for the update. Any reason why we can't close this then?

IdanWo commented 7 years ago

Hey, @colings86 . Can you please explain why increasing shard_size isn't good enough for ordering in all kinds of ways? I know that ElasticSearch can't tell about the error bounds in some cases, but if I don't rely on it and make my own tests and know how many documents I need to take from each shard by using shard_size - then will I be able to always be 100% accurate? For example: making the shard_size equal to the size (on which sum_other_doc_count is always 0). To be concrete, we make terms aggregations, which is ordered by a sub reverse_nested aggregation and then sub value_count aggregation.

In addition, why using the shard_size isn't scaled horizontally? If I need to take a lot of documents from each shard, I can split the index to more shards and assign them to more nodes (at setup time, of course). The shard_size would be the same, that's true, but each shard would be less in size/documents count.

colings86 commented 7 years ago

@IdanWo Increasing shard_size is only good enough if you can guarantee that the shard_size is big enough that all of the terms are returned from each shard. Although this may work for low cardinality fields and/or when the number of shards is relatively small, it does not scale well with number of shards or cardinality of the field. It is true though that in the single shard case and in the case where shard_size > number_terms_buckets_created on every shard the results will be 100% accurate with any ordering.

Although you can indeed split the data across shards you still need to return number_of_shards * shard_size buckets to the coordinating node for the reduce phase in order to get the final result. This means that even though the work on the shards is decreased by splitting the work across more of them, for the same shard_size more shards means that the coordinating node has to hold more in memory (the response form each shard) and do more work during the reduce phase.

IdanWo commented 7 years ago

@colings86 , thanks for the excellent explanation! I understand the circumstances, BUT I believe that something isn't right with the design decisions made: I don't understand why terms aggregation is considered a memory intensive operation where as cross cluster search - which is potentially much memory intensive since it obviously involves multiple indices with multiple shards that return responses to the coordinating node, is considered okay. In cross cluster search the design decision wasn't to limit the request (the request's query or the number of involved shards in the request), but to return batched results to the coordinating node. Why can't we make something similar here? And why actually this improvement doesn't help solving the current issue with a large shard_size in terms aggregation?

Therefore, it seems that there is a motivation to support cross cluster search but a low motivation to support full terms aggregations - although technologically they are quite the same in aspects of performance issues. It seems to me that increasing the shard_size in a request to a single index, is by far less dangerous than making a request to unlimited number of shards at once. How come sometimes the default is unlimited (action.search.shard_count.limit is unlimited) and sometime its nothing and has to be configured (size:0 in terms aggregations is deprecated). Making a terms aggregation in 1000 shards for size:300 is worse than a terms aggregation with size: 0 on 1 index with 10 shards and 200 unique buckets only.

This is taken from the Elasticsearch 5.4.0 released blog post (talking about #23946):

That said, it is quite easy to reach the 1,000 shard limit, especially with the recent release of Cross Cluster Search. As of 5.4.0, Top-N search results and aggregations are reduced in batches of 512, which puts an upper limit on the amount of memory used on the coordinating node, which has allowed us to set the shard soft limit to unlimited by default.

This is taken from the Tribe Nodes & Cross-Cluster Search blog post (pay attention to what is considered a good user experience here):

Now with the addition of cross cluster search where we are emphasizing searches across many, many shards, having such a soft limit isn’t providing a good user experience after all. In order to eliminate the impact of querying a large number of shards, the coordinating node now reduces aggregations in batches of 512 results down to a single aggregation object while it’s waiting for more results to arrive. This upcoming improvement was the initial step to eventually remove the soft limit altogether. The upcoming 5.4.0 release will also allow to reduce the top-N documents in batches. With these two major memory consumers under control, we now default the action.search.shard_count.limit setting to unlimited. This allows users to still limit and protect their searches in the number of shards while providing a good user experience for other users. When you perform a search request, the node which receives the request becomes the coordinating node which is in charge of forwarding the shard-level requests to the appropriate data nodes, collecting the results, and merging them into a single result set. The memory use on the coordinating node varies according to the number of involved shards. Previous, we had added a 1,000 shard soft limit to try to prevent coordinating nodes from using too much memory.

Remark: I can agree that sometimes it's better to stop the use of a bad-practice, instead of enabling it and support its consequences. Personally, I sort terms in descending order but via a sub aggregation - which is also discouraged. But, I really have to do it. So I i'm keep using shard_size: 20000, depending on the terms field, which acts okay until now in our environment (600-800 ms at most for a query, but in most of the times the real number of buckets is considerately lower than 20,000, and is more like 300).

theflakes commented 7 years ago

As others have noted, sorting in ascending order is critical for exploratory data analysis and simple cyber security hunting for LFO events.

I'm probably going to expose my ignorance of ES under the hood cause I am ignorant there. Could a recursive comparison of the aggregate of all shard results help bring more confidence to the returned results? Possibly with sub-queries of certain results in question to specific shards capped at X number of allowed sub-aggregations? I'm sure there are serious performance considerations involved there that my ignorance of ES doesn't make me fully appreciative of.

Again, with all that said, I'll re-emphasize the criticality of ascending search order in many data types I deal with in my work. What concerns me here are not just removing the capability but also the confidence level of the currently returned results when querying across multiple shards, if I'm understanding this correctly.

markharwood commented 6 years ago

Since 5.2 elasticsearch has had support for partitioning in the terms aggs.

For those searching for low-frequency events with accuracy they should use multiple search requests, using an appropriate choice for number of partitions (the documentation for terms partitioning describes how to do this). Essentially you have to ensure numTermsReturned < size setting in the terms agg.

@colings86 given people have a genuine use for this and we have a workaround which maintains accuracy maybe we can keep the reverse-sort feature but error if we determine accuracy is potentially compromised and point to the solution?

cc @elastic/es-search-aggs

polyfractal commented 6 years ago

Hey all. I linked to https://github.com/elastic/elasticsearch/issues/20586 a while ago, but never explicitly commented about it.

We think we've devised an aggregation that will allow aggregating "Rare Terms" in a way that, while not 100% accurate, will provide bounded errors (unlike sorting by count ascending). Our plan is to implement the Rare Terms agg so that there's a path to providing this functionality in a more predictable, bounded manner... and then look into deprecating sorting ascending from Terms agg.

Still no timeline/ETA, but wanted to update everyone about what we were thinking.

jccq commented 6 years ago

uber cool!

epixa commented 6 years ago

cc @elastic/kibana-visualizations

danielo515 commented 5 years ago

by the way, and while you decide if you remove it or not, how are you making queries to get the terms with less doc_count ?

arusanescu commented 5 years ago

I don't see any updates here for a while, what is the current status/path going forward with this? Adding @colings86 to get some traction 👍

polyfractal commented 5 years ago

We're still working on RareTerms aggregation outlined in https://github.com/elastic/elasticsearch/issues/20586 (WIP PR here: https://github.com/elastic/elasticsearch/pull/35718), so the plan outlined in https://github.com/elastic/elasticsearch/issues/17614#issuecomment-393884606 above is still valid.

In short, we want to implement RareTerms first, then deprecate sorting by terms agg ascending, then remove at a later date.

arusanescu commented 5 years ago

@polyfractal Awesome, thank you for the insight on this. I did want to confirm if the statement in the top comment by colings86: "sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts" means that such results are NOT correct in general, or if there is a level of discrepancy then what are we looking at exactly? If it is known for a fact that said sorting doesn't return correct top buckets/counts, I just want to know where we stand with the usage of such sorting. While it is deprecated, if above is true shouldn't there be a stronger message to align with the risks of using this type of sorting?

polyfractal commented 5 years ago

@arusanescu When sorting by count ascending (or sub-agg metrics in many cases), there is a possibility of error. Whether an error creeps into the count or not is dependent on the aggregation (size, shard_size), number of shards and data distribution.

We report the potential worst-case error with a doc_count_error_upper_bound field for the entire aggregation. This returns the highest doc count of a term which potentially didn't make the top-n list, and can be used to help judge how inaccurate the results are. This value can also be enabled on a per-bucket level to get more detailed error reporting.

The idea is that users can determine if they are comfortable with the error being reported, and either accept it or adjust shard_size to get the preferred error (or to not use the particular aggregation at all).

E.g. if the reported error shows a doc_count that could potentially be ranked 95 out of 100 requested results, that might be an error that the user is comfortable with. If the error shows a doc_count that could potentially be ranked 5 out of 100, that's probably unacceptable and they should reconsider.

While it is deprecated, if above is true shouldn't there be a stronger message to align with the risks of using this type of sorting?

If you look at the terms aggregation documentation, something like 50% of the page is dedicated to explaining the errors, how to interpret it, and warning users off sorting :)

I think we've done everything we can do to document the behavior and warn users off of bad sort orders. :)

arusanescu commented 5 years ago

@polyfractal Thank you for the very detailed explanation! While I had read some of the documents I did also miss some and so this has helped me to better understand the problem and how to deal with it! 👍

nik9000 commented 4 years ago

Now that rare_terms has well and truly landed I think it is time to talk about this again. Maybe deprecating the option in a 7.x release and removing it entirely in 8.0.

nik9000 commented 4 years ago

A bunch of us got together and asked what it'd take to make sorting by ascending count accurate. We weren't sure, but we have some interest in giving it a shot, just not any time soon.

polyfractal commented 3 years ago

We had another discussion :)

As much as we'd like to remove ordering by ascending... we don't think that's a practical reality. Too many people and products rely on the functionality, despite it potentially returning unbounded error (sigh). So I'm going to close this ticket since we don't see ourselves deprecating the functionality.

I'm going to open a subsequent ticket to explore if we can possibly support sort ascending in the terms agg more accurately. RareTerms is still our recommended method if you want fast and reasonably accurate accounting of the long-tail... but we might be able to implement some new kind of execution mode or something for terms agg. It would undoubtedly be slower -- potentially a lot slower -- so it's unclear if we want to introduce such a feature. But that discussion should take place on a different issue.

I'll cross-link the tickets when it is open.