Query planning for wildcard queries with keyword fields

rw-access commented 3 years ago

Wildcard queries are pretty common but often come at a performance cost. Although the new wildcard data type is optimized for search speed of high cardinality fields, recent findings have shown that it can come at a cost of ingest speed and index size.

I think there's another option to optimize search speed of wildcard queries on keyword fields, without increasing the ingest cost. Knowing when to search the index vs using docvalues on a limited set of results might lead to better search time.

Wildcard queries are unavoidable in some domains or solutions, like security. They can be a sore spot, so improvements here could have a major positive impact.

For example, here's one KQL query that could be done faster with docvalues:

event.category : process and process.name : regsvr32.exe and process.command_line : *https\:*//*

Rationale/heuristic:

There is a limited number of process events for regsvr32.exe
There are a lot of unique command lines, making it a high cardinality field. That makes an index scan expensive because of the leading *
The number of regsvr32.exe process documents is much less than the total number of unique command lines

Complexities:

How do we estimate the number of documents matching regsvr32.exe? Do we need stats?
What's the additional factor to searching a docvalue? 2x?

Possible approaches:

in the absence of stats, we have to make assumptions about the distribution of process.name values. Linear distribution might be too naive and not account for additional overhead of searching docvalues. Is 1/√N a good estimate to start? A geometric mean seems safer
See what SQL typically does, and look at it's explanations
Make the decision late: search in two passes, count how many results match the first phase (event.category : process and process.name : regsvr32.exe) and then look at the cost of an index scan (process.command_line : *https\:*//*) to determine then which approach is better.

I will defer to the experts @markharwood @jpountz to decide and figure out what's best. My understanding of low-level details here is looking limited, so please don't take my suggestions too literally. Whatever approach is best, I do think that improving wildcard search speed would add significant value to our users, without significant cost.

Links This is a similar approach to range query planning, as was done in https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch

elasticmachine commented 3 years ago

Pinging @elastic/es-perf (Team:Performance)

jpountz commented 3 years ago

How do we estimate the number of documents matching regsvr32.exe? Do we need stats?

Actually we have these stats already in the inverted index, called "document frequency", and this is generalized across all queries, which by contract must be able to provide an estimate of their number of matches, called "cost". We use these costs today in order to figure out the best order to evaluate clauses in conjunctive boolean queries, from the one that has the lowest number of matches to the one that has the greatest number of matches.

Let me share some additional context.

On wildcard fields, wildcard/regexp queries are generally parsed as the conjunction between a fast approximation based on substrings that can be extracted from the query, and a slow veryfication using doc values. I expect these queries to run rather efficiently, including when intersected with highly-selective clauses.

Keyword fields are harder because they don't directly give access from documents to values, there is an indirection where documents are associated with ordinals, which uniquely identify terms in the terms dictionary of the field. This gives us two strategies for running doc values against keyword fields:

Either extract matching ordinals from the terms dictionary and then return a scorer that checks the ordinals on a document-per-document basis against this bitset. This has a high up-front cost, but then the per-document cost is low.
Or dynamically look up terms from the ordinals at runtime, and run these terms against the wildcard/regexp to figure out whether the document matches. There is no up-front cost anymore, but the per-document cost is now high. This would be even slower when random access to the terms dictionary is slow, e.g. on spinning disks or the frozen tier.

The most common performance penalty with regexp/wildcard queries comes from the up-front cost of evaluating all terms of the terms dictionary against the query (even though this is done in an optimized way), so we'd need to use the latter approach.

I think that your idea to use the field cardinality as a metric for the cost of the wildcard/regexp query is a good one, and like you guessed, the hard work will be to figure out a good factor to decide whether we should use the index or doc values to run the query

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elastic / elasticsearch

Query planning for wildcard queries with keyword fields #70612