Open rw-access opened 3 years ago
Pinging @elastic/es-perf (Team:Performance)
How do we estimate the number of documents matching regsvr32.exe? Do we need stats?
Actually we have these stats already in the inverted index, called "document frequency", and this is generalized across all queries, which by contract must be able to provide an estimate of their number of matches, called "cost". We use these costs today in order to figure out the best order to evaluate clauses in conjunctive boolean queries, from the one that has the lowest number of matches to the one that has the greatest number of matches.
Let me share some additional context.
On wildcard fields, wildcard/regexp queries are generally parsed as the conjunction between a fast approximation based on substrings that can be extracted from the query, and a slow veryfication using doc values. I expect these queries to run rather efficiently, including when intersected with highly-selective clauses.
Keyword fields are harder because they don't directly give access from documents to values, there is an indirection where documents are associated with ordinals, which uniquely identify terms in the terms dictionary of the field. This gives us two strategies for running doc values against keyword fields:
The most common performance penalty with regexp/wildcard queries comes from the up-front cost of evaluating all terms of the terms dictionary against the query (even though this is done in an optimized way), so we'd need to use the latter approach.
I think that your idea to use the field cardinality as a metric for the cost of the wildcard/regexp query is a good one, and like you guessed, the hard work will be to figure out a good factor to decide whether we should use the index or doc values to run the query
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Wildcard queries are pretty common but often come at a performance cost. Although the new
wildcard
data type is optimized for search speed of high cardinality fields, recent findings have shown that it can come at a cost of ingest speed and index size.I think there's another option to optimize search speed of wildcard queries on
keyword
fields, without increasing the ingest cost. Knowing when to search the index vs using docvalues on a limited set of results might lead to better search time.Wildcard queries are unavoidable in some domains or solutions, like security. They can be a sore spot, so improvements here could have a major positive impact.
For example, here's one KQL query that could be done faster with docvalues:
Rationale/heuristic:
regsvr32.exe
*
regsvr32.exe
process documents is much less than the total number of unique command linesComplexities:
regsvr32.exe
? Do we need stats?Possible approaches:
1/√N
a good estimate to start? A geometric mean seems saferevent.category : process and process.name : regsvr32.exe
) and then look at the cost of an index scan (process.command_line : *https\:*//*
) to determine then which approach is better.I will defer to the experts @markharwood @jpountz to decide and figure out what's best. My understanding of low-level details here is looking limited, so please don't take my suggestions too literally. Whatever approach is best, I do think that improving wildcard search speed would add significant value to our users, without significant cost.
Links This is a similar approach to range query planning, as was done in https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch