Automated query rewriting - deciding on Phrase vs AND vs OR for a search.

markharwood commented 6 years ago

Given a user's search input e.g. bump stocks a search system needs to choose if this should be treated as a phrase, ANDed or ORed. Often the advice given is "use all three - Lucene scoring will sort out the ranking". This may work well on ranking the hits but consider that all the aggregations e.g. timelines or category counts will be messed up if you over-match. For this reason it is common practice for faceted ecommerce type search interfaces to start with a strict interpretation of the query (e.g. an exact phrase match) and then back-off to a more relaxed AND or OR only if a handful of exact-match search results are produced for the user's search. The minimum number of results that marks this relaxation threshold is a somewhat arbitrary decision and I think we can do better.

In this demo video I illustrate how a combination of Span and FunctionScore queries along with significant_text and adjacency_matrix aggregations can be used to compare the results of a sloppy and exact-match interpretations of a query. The resulting "concept-graphs" (buzzwordy, I know) can show the difference between cohesive and ambiguous result sets. Aggregations should really only be done on cohesive result sets and I suspect we can compute a "cohesion measure" for each of the graphs which takes into account: 1) How tightly bound the "significant" terms are - single cloud or many "islands"? 2) The IDF of the terms - a weak result set produces tightly-bound but common words like "with" and "this"

Once we have a measure like this we can compare the effects of various forms of query tightening or relaxation on the results as a whole. In each trial we: 1) Formulate a query slightly more relaxed than the last one by: a) increasing span slop factor or b) switching from Span query to AND c) dropping a MUST clause for a common word. 2) Review the result set by

Identifying significant_text (using a random-scored sample ensures fast coverage of the head and tail of the query's result set)
Getting an adjacency_matrix result for the most significant terms
Computing a cohesion score for the resulting graph.

Example graphs are below:

Here is the strictest exact-phrase-match set of results:

and by contrast here are the results for a sloppy OR relaxation of the search terms:

Visually these look very different the exact match contains comparatively rare, high-precision terms while the sloppy match contains none of these, instead consisting of commonplace words. It should be possible to formulate an automated query-tuning system that can contrast these differences and pick the most appropriate strategy. Note that picking the correct exact-match strategy will produce a timeline aggregation that looks like this:

while picking the wrong sloppy strategy will produce misleading aggregations that looks like this:

Obviously it will be expensive to perform this level of analysis to every user query in real-time but a batch system could pre-compute the right choice of querying template for the most popular queries on a website.

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

markharwood commented 6 years ago

@qhoxie - Colin GS said you might have an interest in this. Would welcome any input.

markharwood commented 6 years ago

I've experimented with some query rewriting strategies for searches on news data. As I suggested previously, the goal here is to improve recall by carefully measuring any loss in precision.

Cohesion scoring approaches - establishing the gold standard.

I chose to use the list of significant terms from an exact-phrase match as the gold standard against which sloppier query variations are compared. Together the significant terms give an indication of the meaning in the gathered results. The assumption is always that this is a strong starting query. A weak or ambiguous query can be detected by examining the adjacency-matrix of top-terms e.g. this search for "black ice":

However, I have chosen not to focus on the problem of correcting weak starting queries with "did you mean ACDC or icy road?" type suggestions. These sorts of ambiguous phrase queries are relatively rare. My assumption for now is that the starting phrase query is unambiguous and can be used as the gold standard against which sloppier variations can be measured. An example of a strong query is the phrase bump stocks and we will use this to illustrate our approach. The graph of top significant terms is pretty cohesive:

How to score query relaxations

Given a loosening of the query (increased slop factor on spans/ ORs instead of ANDs...) the results are compared against the gold standard. I've found that it wasn't strictly necessary to diff graph shapes of significant term connections. Such a comparison would require an extra call using an adjacency_matrix aggregation. A simpler comparison is to take the significant terms from the sloppier query results and compare this flat list against those of the gold standard. The scoring can be weighted so that each term's significance_score adds to the total. The relaxation score for a result set is therefore:

  sumOfVariantQueryTermScoresFoundInGold / sumOfGoldQueryTermScores

So using our "bump stocks" example, the gold results contain the word recoil with a significance score of 913 while the sloppy variant "bump OR stocks" has no such term, instead dominated by words like nasdaq which are related to stocks in companies. The score for this relaxation is a terrible "zero". I found most phrases are measurably poor when fully relaxed to an OR. An example of a query that can be relaxed is "metoo movement" - the word "movement" can be made optional and we see the score is still comparable to gold but the recall is much higher - twice the number of docs are matched.

Considering the garbage in the long tail

Note that when queries match a large number of docs it is important that we don't focus the significant_text aggregation on a sample which consists of just the top-matching docs. I was examining samples of 200 docs and if we relax the query "bump stocks" to bump OR stocks we would still expect to see the 200 top-matching docs to consist entirely of documents that contain both terms as a phrase. Lucene is too good here. To get a feel for what is lurking in what might be a long tail of garbage results we need to use a function_score query with a random scorer to sample documents from the entire result set. Also note that if we want to give people a good indication of why relaxing a particular query phrase would be a bad choice a good way to illustrate this is to examine significant terms in the randomised scoring of the sloppy minus the gold i.e.

 (bump OR stocks) NOT "bump stocks"

This reveals the main culprits for any misunderstanding of a relaxed query - in this case: stock market, fist bump etc

Comparing temporal rhythyms

Note that another potentially useful measure of result-set similarity in event reporting is to compare the timelines of results as in the 2 timeline images shown in my first comment above. A Pearson correlation of the 2 timelines can give a measure that would help to understand if a variant of a query is following the same reporting cycle as that of the original query. Terms that don't co-appear in a lot of docs but genuinely relate to the same concept (e.g. steven spielberg and stephen spielberg) can be suggested if they share a common news cycle.

Alternative query expansion approach - term expansion

Two word shingles have proven to be incredibly useful in the significant_text aggregation - they improve the utility and readability of suggestions. People names like jeremiah cottle just pop out of the search results. However, there is a tension between the readability of these indexed terms and their mangling during indexing. If we index with a stemmer then we end up with confusing tokens like gun accessori rather than gun accessory or gun accessories. I chose not to stem in my tests to keep readability but it does mean a common end user task is to search for "bump stock" OR "bump stocks". I think an automated query rewriter can help with this routine task - the significant text results often contain singular and plural forms and a simple pattern-matching rule based on common prefix lengths can add these variations into an expanded query. This works well because often the term that needs pluralising in a significant shingle is the last word because they are often of the form [adjective] [noun] e.g nuclear war(s). Additionally, this approach overcomes a limitation of stemming which fails to normalize common forms of expansion eg.nationalist, nationalism, nationalists, nuclear wars and nuclear warfare. By only making expansions selected from a set of terms statistically proven to be correlated we can have high confidence in these expansions (and, if required, prove their effectiveness using the techniques described above).

Conclusions

Most two-word searches measurably lose all meaning when relaxed to an OR. This is especially true of name searches e.g. shaun white. We can seek out and measure relaxations automatically using the techniques described above but it is expensive to do in realtime. A simple rule based on matching doc counts is likely to suffice where if the exact-phrase match produces less than N results the query is rewritten to an AND, then an OR, then a fuzzy. However, automated query expansions using significant terms that share a common prefix looks like it would be a useful technique where stemming is avoided

wchaparro commented 3 months ago

This is not something we plan to implement in the near future, and has been superceded by our focus on ESQL. Closing as not planned. If your feel strongly about this one, please let us know.

elastic / elasticsearch