elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.58k stars 24.63k forks source link

Support collapsing on runtime fields #66459

Open jonathan-buttner opened 3 years ago

jonathan-buttner commented 3 years ago

Summary

The Security Threat Hunting team has a use case for collapsing on runtime fields. In this issue I'll try to describe this use case and why using a terms aggregation will not provide a solution.

Throughout this issue I'll reference fields like process.entity_id (keyword) and process.Ext.ancestry (an array of keywords in a specific order). These fields come from the elastic endpoint data source. Our goal is to leverage runtime fields to allow users to leverage our tool with custom data sources.

TLDR

Background

Details Our team is building a tool to allow analysts to visualize relationships between events from a data source. Our first use case was to allow a process tree to be visualized. Below is an example of the visualization
Analyze Event Tool ![image](https://user-images.githubusercontent.com/56361221/102378716-1847cb00-3f94-11eb-8415-28b0a99eb5ee.png)
## Nodes and edges Currently, nodes in the graph represent unique processes. A downward edge in the graph represents a process spawning a child process. When laying out the graph we rely on a few fields in the documents. `process.entity_id` is used as a globally unique identifier. `process.parent.entity_id` is used as a key to link two nodes together. Consider the following graph:
Example Simple Graph ![resolver_tree_children_simple](https://user-images.githubusercontent.com/56361221/102380007-6f9a6b00-3f95-11eb-8467-241a11a29cf1.png)
In this example, the `process.entity_id` for the `A` node, is `A`. In this example `A` does not have a `process.parent.entity_id`. Node `A` spawned nodes `B` and `G`. Their fields are detailed below: - `B` - `process.entity_id: B` - `process.parent.entity_id: A` - `G` - `process.entity_id: G` - `process.parent.entity_id: G` The rest of the nodes in the graph follow the same pattern. ## Searching for ancestors of a node To layout the example graph above we would first look for ancestors of node `A`. The pseudo code for that process is something like: - Find all the ancestors of the node of interest (`A`) - Search for documents that have `process.entity_id` matching the node of interest `process.parent.entity_id` field - If a document is found then repeat the loop continuing up the ancestors This search is expensive because we have to make a request to Elasticsearch for each ancestor of the starting node. ## Searching for descendants of a node Searching for descendants of a node is similar. If we want to find descendants of node `A` we would search for all documents where the `process.parent.entity_id` field equals `A`. Again this search is expensive because we have to search level by level. ## Ancestry field optimization To avoid having to make a request for each level when searching we leverage a field that keeps track of the ancestry. This field is `process.Ext.ancestry`. The ancestry field is an array of `entity_id` values for the node's ancestry in a specific order. Entries closer to index 0 in the array are closer ancestors of the node. For example: ``` ancestry[0] == parent ancestry[1] == grandparent etc ``` This field is populated by the tool (the elastic endpoint) that inserts the documents into Elasticsearch. This field is particularly helpful for finding descendants of a node. Instead of having to make a request per level, we can search for all documents that have the node of interest's `process.entity_id` in the `process.Ext.ancestry` field. If we consider this example again:
Example Simple Graph ![resolver_tree_children_simple](https://user-images.githubusercontent.com/56361221/102380007-6f9a6b00-3f95-11eb-8467-241a11a29cf1.png)
The ancestry array is indicated by the array (`[...]`) next to each node in the graph. Node `C` has an ancestry array of `[B, A]` because `B` is the direct parent and `A` is the grandparent.

The issue with using a terms aggregation

The elastic endpoint creates specific events to describe the different stages of a process (started, stopped, already running, exec'ed). Because of this we'd like to collapse on the process.entity_id to avoid retrieving multiple documents per process.entity_id. A terms aggregation can be used for this but we'd also like to sort the results in breadth-first order. This can be accomplished by using a query like this:

Terms agg BFS ``` POST logs-*/_search { "size": 0, "query": { "bool": { "filter": [ { "term": { "process.Ext.ancestry": "9tw2j9fryf" } }, { "term": { "event.category": "process" } }, { "term": { "event.kind": "event" } } ] } }, "aggs": { "by_entity_id": { "terms": { "field": "process.entity_id", "size": 100, "order": { "bfs_sort": "asc" } }, "aggs": { "top_children": { "top_hits": { "_source": ["process.Ext.ancestry", "process.entity_id", "process.parent.entity_id"], "size": 1, "sort": [ { "@timestamp": { "order": "asc" } } ] } }, "bfs_sort": { "max": { "script": { "source": """ Map ancestry = [:]; int length = params._source.process.Ext.ancestry.length; List sourceAncestryArray = params._source.process.Ext.ancestry; for (int i = 0; i < length; i++) { ancestry[sourceAncestryArray[i]] = i; } for (String id : params.ids) { def index = ancestry[id]; if (index != null) { return index; } } return -1; """, "params": { "ids": ["yo", "9tw2j9fryf"] } } } } } } } } ```

The script that is used in the bfs_sort calculates how far removed each descendant is from the requested node which effectively groups the documents by level.

In our testing, we found that if the size field for the terms aggregation was less than the total number of documents, the terms aggregation would fail to return certain nodes, or entire levels in the response. This issue is describe in the docs I believe: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-size

Collapsing on runtime fields

Currently to get around the terms aggregation size issue we collapse in our Elasticsearch requests on the process.entity_id field (or a runtime field specified by the user). We then use a script to sort the results in BFS order. Something like this:

Collapse query example ```typescript { _source: false, docvalue_fields: this.docValueFields, size, collapse: { // this.schema.id is process.entity_id or a field that the user chooses field: this.schema.id, }, sort: [ { _script: { type: 'number', script: { /** * This script is used to sort the returned documents in a breadth first order so that we return all of * a single level of nodes before returning the next level of nodes. This is needed because using the * ancestry array could result in the search going deep before going wide depending on when the nodes * spawned their children. If a node spawns a child before it's sibling is spawned then the child would * be found before the sibling because by default the sort was on timestamp ascending. */ source: ` Map ancestryToIndex = [:]; List sourceAncestryArray = params._source.${ancestryField}; int length = sourceAncestryArray.length; for (int i = 0; i < length; i++) { ancestryToIndex[sourceAncestryArray[i]] = i; } for (String id : params.ids) { def index = ancestryToIndex[id]; if (index != null) { return index; } } return -1; `, params: { // nodes are the requested nodes of interest to find descendants for ids: nodes, }, }, }, }, { '@timestamp': 'asc' }, ], ... } ``` https://github.com/elastic/kibana/blob/master/x-pack/plugins/security_solution/server/endpoint/routes/resolver/tree/queries/descendants.ts#L87

Not sure if it makes a difference but we are not using (and don't have plans to) the inner_hits functionality of collapse.

jonathan-buttner commented 3 years ago

@nik9000 @javanna @jimczi wanted to explain a little more about our use case for collapse on runtime fields. I'm happy to do a follow up zoom or explain more if there are areas that don't make sense.

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

nik9000 commented 3 years ago

I think it's mostly an oversight that runtime fields don't support field collapsing. It might be slow, but when you use runtime fields you accept slow things.

bsamseth commented 2 years ago

I know this issue has been here for a while, but it doesn't seem to have any progress. My company's use case would be solved neatly by a runtime field in conjunction with collapse. Are there any plans to implement this?

At the very least the documentation should be updated. A quote from the docs:

You access runtime fields from the search API like any other field, and Elasticsearch sees runtime fields no differently.

This clearly isn't the case, and my case caused quite a bit of wasted time trying to implement this before realizing collapse is not supported. This is made worse by the error that you get if you try this. Using an example from the docs, adding a collapse:

GET my_index/_search
{
  "runtime_mappings": {
    "day_of_week": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))"
      }
    }
  },
  "collapse": {
    "field": "day_of_week"
  }, 
  "query": {
    "match_all": {}
  }
}

produces an error saying collapse is not supported for the field [day_of_week] of the type [keyword]. This makes it seem like collapse doesn't support keyword types, which is not the case. Going from this error to realizing that it means "collapse is not supported for the runtime field ..." is not obvious, provided the docs say the opposite.

zdeseb commented 2 years ago

Just want to let you know that in our use-case we too miss collapse on runtime fields.

It is strange (and one would not expect so) that it works well with cardinality but not with collapse feature.

SouzaGabrielC commented 1 year ago

It seems this is still an issue and I can relate to what @bsamseth commented, had the same problem trying to use it and hitting a dead end with that error only to realize it is a runtime field problem.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)