Faster support for large exclusion lists in searches

markharwood commented 3 years ago

Some applications require large filter lists e.g. the lists of "swipe lefts" each user of a dating site has accumulated and doesn't want to see again in new searches.

These are typically presented as a large list of terms in a must_not clause used to filter another query. Sometimes the main query producing the matches returns less documents than the large number of IDs in the filter clause. This scenario is needlessly expensive because all the terms in the filter are first looked up and turned into a bitset at great cost (linear with the number of filter terms). If we recognise this scenario (must_not + large list of single-value terms on a common field) then we can potentially execute the request more efficiently by making the filter access doc values of documents matching the main query and checking to see if the value is in an exclusion hashset. The trade-off might be a bad one if the query is not very selective (e.g. "find all females anywhere") but given a high number of filter clause terms it might be a safe bet to always assume the doc-values approach.

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

romseygeek commented 3 years ago

This is a nice idea! Lucene has a DocValuesTermsQuery specifically for this usecase, and MUST_NOT clauses are only evaluated on docs that have already matched on other clauses so we may find that performance is not an issue even on quite small sets of terms.

markharwood commented 3 years ago

I benchmarked a quick script-based solution vs terms query and it looks like it does provide a pay-off for very large exclusion lists.

A native impl will clearly be better than the script but it does demonstrate there's a fine cut-off point between picking strategies and I wonder if this points to the need to allow users to provide the equivalent of the execution_hint we use elsewhere for guidance. These results were for searches matching about 80k docs but the performance will vary depending on the number of docs matching the main query (For clarity I will work on adding benchmarks varying result set size). If we characterise the costs of the 2 matching strategies: 1) Terms query cost is linear with the number of search terms 2) DV query cost is linear with the number of docs matching the other parts of the query

It will be therefore be hard to pick the right strategy in advance.

markharwood commented 3 years ago

Further to last comment - the effectiveness of dv query vs terms query also depends on the number of matching docs. The ratio between search terms and matching docs has to be skewed towards search terms.

With 10k exclusion terms, the DV approach starts out faster than terms query but becomes the slower strategy as the other query term is relaxed and we match more docs:

However, with 50k exclusion terms, the dv approach remains the faster strategy for the increasing volumes of matching docs under this test.

Conclusion

Without knowing the number of docs matched by the main query (which is hard to predict in advance) we don't know if selecting a doc-values approach will be faster than a terms query approach while executing the query.

markharwood commented 3 years ago

https://github.com/elastic/elasticsearch/issues/70612 is another open "use doc-values or index query?" issue so should probably share the same cost analysis function to help inform the choice of execution taken.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elastic / elasticsearch

Faster support for large exclusion lists in searches #74954

Conclusion