Open markharwood opened 3 years ago
Pinging @elastic/es-search (Team:Search)
This is a nice idea! Lucene has a DocValuesTermsQuery specifically for this usecase, and MUST_NOT clauses are only evaluated on docs that have already matched on other clauses so we may find that performance is not an issue even on quite small sets of terms.
I benchmarked a quick script-based solution vs terms
query and it looks like it does provide a pay-off for very large exclusion lists.
A native impl will clearly be better than the script but it does demonstrate there's a fine cut-off point between picking strategies and I wonder if this points to the need to allow users to provide the equivalent of the execution_hint
we use elsewhere for guidance.
These results were for searches matching about 80k docs but the performance will vary depending on the number of docs matching the main query (For clarity I will work on adding benchmarks varying result set size). If we characterise the costs of the 2 matching strategies:
1) Terms query cost is linear with the number of search terms
2) DV query cost is linear with the number of docs matching the other parts of the query
It will be therefore be hard to pick the right strategy in advance.
Further to last comment - the effectiveness of dv query vs terms query also depends on the number of matching docs. The ratio between search terms and matching docs has to be skewed towards search terms.
With 10k exclusion terms, the DV approach starts out faster than terms query but becomes the slower strategy as the other query term is relaxed and we match more docs:
However, with 50k exclusion terms, the dv approach remains the faster strategy for the increasing volumes of matching docs under this test.
Without knowing the number of docs matched by the main query (which is hard to predict in advance) we don't know if selecting a doc-values approach will be faster than a terms query approach while executing the query.
https://github.com/elastic/elasticsearch/issues/70612 is another open "use doc-values or index query?" issue so should probably share the same cost analysis function to help inform the choice of execution taken.
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Some applications require large filter lists e.g. the lists of "swipe lefts" each user of a dating site has accumulated and doesn't want to see again in new searches.
These are typically presented as a large list of
terms
in amust_not
clause used to filter another query. Sometimes the main query producing the matches returns less documents than the large number of IDs in the filter clause. This scenario is needlessly expensive because all the terms in the filter are first looked up and turned into a bitset at great cost (linear with the number of filter terms). If we recognise this scenario (must_not + large list of single-value terms on a common field) then we can potentially execute the request more efficiently by making the filter access doc values of documents matching the main query and checking to see if the value is in an exclusion hashset. The trade-off might be a bad one if the query is not very selective (e.g. "find all females anywhere") but given a high number of filter clause terms it might be a safe bet to always assume the doc-values approach.