elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.49k stars 24.88k forks source link

EQL: Sequence performance improvements #60833

Open rw-access opened 4 years ago

rw-access commented 4 years ago

I've noticed that sequences tend to timeout if one of the subqueries in the sequence yields too many results. This seems to suggest that the most expensive subquery is what drives the query. In some cases, we could update our sequencing algorithm to have the least expensive subquery drive the performance.

few -> few

:These sequences work great and need no improvements.

few -> many

Here's an example query that will use a sequence to join certutil.exe with its child process. Ideally, process.parent.name == "certutil.exe" would be added to the child to solve this problem, but we can still use it as an example.

sequence by agent.id with maxspan=1m
  // matches 6 results
  [process where process.name == "certutil.exe"] by process.entity_id

  // matches millions of results
  [process where event.type in ("start", "process_start")] by process.parent.entity_id

There are at most six sequences that could be returned from this. There are a few ways we could skip over large swaths of results for the second subquery to have faster performance time. I think we need to use search_after: <certutil1.@timestamp> to skip straight ahead. At most, one minute of data is required for that one sequence. I think using a search_after optimization while maintaining correctness is nontrivial, but probably worth the change. At most, you should need to consume 1 minute of data * 6 possible sequences, assuming you overlap.

many -> few

A similar example query could use a sequence to join certutil.exe to its parent process. In most scenarios, this exact query wouldn't be necessary because you can usually find the fields you need in process.parent, but it's a good example of the point. Same query as before, but flipped:

sequence by agent.id with maxspan=1m
  // matches millions of results
  [process where event.type in ("start", "process_start")] by process.parent.entity_id

  // matches 6 results
  [process where process.name == "certutil.exe"] by process.entity_id

Now, the second term is the driving factor. But there's no search_before unfortunately. But we could switch the sort order for search after. We can also use a range query to constrain this from both directions.

{
  "query": { /* original query */ },
  "search_after": ["current timestamp", "current tiebreaker"],
  "sort": [{"@timestamp": {"order": "desc"}, "<tiebreaker>": {"order": "desc"}]
  "size": 1
}

Now at a minimum, you have timestamp + tiebreaker, and you can switch back to your original order. Again this optimization is tricky to maintain correctness, especially for sequences more than 2 events long.

many -> many

These optimizations are more complex, and I think the best ways to scale will involve more query planning using aggregations or histograms to see where the results are more distributed. I think these are more complex, but for now, I think it's reasonable to have users adjust expectations until these are more optimal.

elasticmachine commented 4 years ago

Pinging @elastic/es-ql (:Query Languages/EQL)

blookot commented 2 years ago

Hi team, is https://discuss.elastic.co/t/eql-sequence-performance-issue/300601 related to this issue?

elasticsearchmachine commented 10 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)