elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.28k stars 24.86k forks source link

Distinct Terms Aggregations #23818

Closed gingerwizard closed 4 years ago

gingerwizard commented 7 years ago

Ticket to request a new terms aggregation capable of identifying distinct values in a field, based on a query restriction. This aggregation would return values for a field in the matching document set, that do not exist in the non matching documents. This can be used to answer questions such as "Give me the values for a field which are new in the last N minutes".

This feature request comes out of use of Watcher to detect new values in fields - a common requirement in security analytics. Problems such as "lateral movement in communications" (user logs onto a new server) or "new process started on a server" currently require multiple query executions. Currently this has led to multiple implementations of watches, with varying degrees of efficiency.

Noting the following approach as discussions with @clintongormley indicate this is the current best approach for detecting new values in a field:

An example watch for detecting a new process can be found here.

This specific watch looks for values in the last N mins for a field, identifying those which are "new".

Given the frequency of this requirement in Watcher, @clintongormley suggested potentially implementing this in ES. The above approach is efficient, provided the background set X is small. In most cases, we believe the set of matching documents would be smaller than those than do not match - thus the above approach is likely the most efficient. Pending discussion as to whether this can be implemented at a node or shard level within ES. https://github.com/elastic/elasticsearch/issues/12316 may enable this to be implemented. Would require careful document that the aggregation is performant provided the matching set is small. We may want to also narrow the requirement to detecting "new" values using time only - with the restriction within the aggregation.

cc @sarwarbhuiyanm @LucaWintergerst @mikeh-elastic @MikePaquette have encountered similar requirements. @skearns64

jpountz commented 7 years ago

It does not mean that nothing should be done, but I can see some similarities with the significant_terms aggregation, since this is looking at things that occur significantly more in a foreground set than in a background set.

skearns64 commented 7 years ago

I've done similar things for low cardinality fields (low-mid hundreds of distinct values). Applying the use-case to data representing "running processes on a server captured from metricbeat," I defined the problem as "alert me when there is a process name that has been seen in the last hour, that wasn't seen in the previous 30 days". It's not a scalable solution, but I used an approach with a terms agg -> date range filter with 2 filters: (now to now-1h and now-1h to now-90d). Then used a simple script in Watcher to check for terms that had hits in the last hour, but not the prior 90 days. This approach requires that you have the terms agg returning all value, which is why this only works for really low-cardinality fields, and isn't a good solution for most cases.

I would have happily used an aggregation like the one described here instead of my hacky approach.

gingerwizard commented 7 years ago

@skearns64 yes the approach you describe works for low cardinality fields. The approach outlined above (two stage query execution) discussed with @clintongormley works better for higher cardinality fields in theory - and is thus more scalable - provided the matching set is low.

nirmalc commented 7 years ago

To find exclusive terms , my approach was to patch significant_terms as @jpountz suggests here. Added a "min_score" and "max_score" in Percentage heuristics . min_score: 1.0 would result in "distinct" terms.

https://github.com/nirmalc/elasticsearch/commit/6e2272503557046f4a89737998a953546c43205c

I can submit a PR if this approach makes sense.

curl -H "Content-type: application/json" -XPOST localhost:9200/_search?pretty=true  -d '{ "query" : { "term" : { "fred": "t32"} }, "aggs" : { "test": { "significant_terms": { "field" : "fred.keyword" , "min_doc_count" : 1, "percentage" : { "min_score" : 1 } }}}}'
nirmalc commented 7 years ago

submitted PR as it may be useful in some cases to have min/max

colings86 commented 6 years ago

@elastic/es-search-aggs

polyfractal commented 6 years ago

This could probably be done (somewhat) scalably, in a single-pass fashion using a set and a bloom filter.

Rough algo:

  1. Query decides if the doc matches the criteria or not.
    a. If matches, add to bloom filter, remove from set if it exists b. If doesn't match, check if it is in bloom filter, if not add to set

And that's it. At the end, the set of terms are the terms that are likely not in the set that matched the query. It gives you the approximate right outer join, but because of the bloom filter semantics it may be missing some terms.

Might be more complicated than that to work with depth_ vs breadth_first, but that's the general idea.

Memory requirements are a bit dicier, being bloom size + size of set. Would be fairly easy to limit however (bloom size is known, hard cap on set size).

polyfractal commented 4 years ago

We're going to close this since there hasn't been much interest over the years. We did have a chat, and it might be possible to implement either A) as I suggested above, or B) by tweaking sig-terms agg to "threshold" background frequencies. I.e. if a term has zero frequency in the background set but positive in the foreground, it would satisfy the request.

We're not sure if it's worth pursuing so closing for now... but happy to re-open if we find more support for such a feature :)