apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

BooleanWeight should decide how to execute minNrShouldMatch [LUCENE-4872] #5937

Open asfimport opened 11 years ago

asfimport commented 11 years ago

5637 adds a dedicated document-at-time scorer for minNrShouldMatch which can use advance() behind the scenes.

In cases where you have some really common terms and some rare ones this can be a huge performance improvement.

On the other hand BooleanScorer might still be faster in some cases.

We should think about what the logic should be here: one simple thing to do is to always use the new scorer when minShouldMatch is set: thats where i'm leaning.

But maybe we could have a smarter heuristic too, perhaps based on cost()


Migrated from LUCENE-4872 by Robert Muir (@rmuir), 1 vote, updated May 09 2016 Parent: #5637 Attachments: crazyMinShouldMatch.tasks

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

To really do this right I think we need a better tasks file for luceneutil probably too.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I generated some random biggish minShouldMatch queries (attached), and then ran luceneutil perf test where base=trunk (always use BS2 if minShouldMatch > 1) and comp= always use BS1:

                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
      20Terms15High20MSM      273.33      (2.5%)        0.87      (0.0%)  -99.7% ( -99% -  -99%)
      25Terms16High25MSM      328.44      (3.1%)        1.41      (0.0%)  -99.6% ( -99% -  -99%)
       15Terms7High15MSM      378.55      (2.5%)        1.66      (0.0%)  -99.6% ( -99% -  -99%)
      15Terms12High15MSM      437.86      (3.1%)        1.97      (0.0%)  -99.6% ( -99% -  -99%)
       25Terms8High25MSM      304.31      (3.0%)        1.46      (0.0%)  -99.5% ( -99% -  -99%)
       10Terms7High10MSM      560.31      (2.2%)        3.35      (0.0%)  -99.4% ( -99% -  -99%)
      15Terms11High15MSM      529.42      (3.0%)        3.98      (0.0%)  -99.2% ( -99% -  -99%)
      20Terms11High19MSM      214.71      (4.7%)        2.22      (0.0%)  -99.0% ( -99% -  -98%)
      20Terms15High19MSM      138.97      (5.7%)        1.46      (0.1%)  -98.9% ( -99% -  -98%)
      20Terms10High19MSM      377.77      (3.3%)        4.59      (0.1%)  -98.8% ( -98% -  -98%)
      25Terms10High24MSM      101.62      (5.7%)        1.31      (0.1%)  -98.7% ( -98% -  -98%)
       20Terms8High19MSM      108.52      (5.9%)        1.55      (0.1%)  -98.6% ( -98% -  -98%)
         5Terms2High5MSM      383.99      (1.4%)        5.48      (0.1%)  -98.6% ( -98% -  -98%)
         5Terms3High5MSM      355.19      (1.7%)        5.14      (0.1%)  -98.6% ( -98% -  -98%)
      25Terms14High24MSM      234.80      (3.8%)        3.43      (0.1%)  -98.5% ( -98% -  -98%)
       10Terms4High10MSM      802.64      (2.4%)       11.76      (0.1%)  -98.5% ( -98% -  -98%)
      15Terms11High14MSM      252.84      (5.0%)        4.07      (0.1%)  -98.4% ( -98% -  -98%)
       15Terms8High14MSM      286.52      (4.8%)        4.94      (0.1%)  -98.3% ( -98% -  -98%)
      25Terms16High22MSM       83.95      (5.6%)        1.46      (0.1%)  -98.3% ( -98% -  -98%)
      15Terms10High13MSM      200.36      (4.9%)        4.72      (0.1%)  -97.6% ( -97% -  -97%)
       25Terms2High25MSM      400.74      (2.9%)        9.84      (0.1%)  -97.5% ( -97% -  -97%)
       10Terms1High10MSM      443.03      (1.9%)       11.48      (0.1%)  -97.4% ( -97% -  -97%)
       25Terms7High23MSM      119.41      (4.9%)        3.38      (0.1%)  -97.2% ( -97% -  -96%)
         5Terms1High5MSM      791.03      (1.5%)       22.66      (0.1%)  -97.1% ( -97% -  -96%)
       25Terms6High22MSM       65.43      (6.5%)        2.16      (0.2%)  -96.7% ( -97% -  -96%)
       15Terms8High11MSM       49.38      (7.8%)        1.67      (0.2%)  -96.6% ( -97% -  -96%)
       15Terms7High12MSM       87.96      (5.9%)        3.55      (0.2%)  -96.0% ( -96% -  -95%)
      25Terms10High18MSM       55.03      (6.6%)        2.30      (0.2%)  -95.8% ( -96% -  -95%)
        10Terms4High8MSM      133.95      (5.9%)        5.67      (0.2%)  -95.8% ( -96% -  -95%)
        10Terms5High7MSM       62.52      (7.0%)        2.71      (0.2%)  -95.7% ( -96% -  -95%)
        10Terms3High8MSM       76.74      (6.8%)        3.51      (0.2%)  -95.4% ( -95% -  -94%)
       15Terms7High11MSM       73.30      (6.3%)        3.43      (0.2%)  -95.3% ( -95% -  -94%)
         5Terms4High5MSM      310.05      (2.4%)       14.54      (0.2%)  -95.3% ( -95% -  -94%)
       15Terms3High13MSM      256.05      (4.5%)       12.36      (0.2%)  -95.2% ( -95% -  -94%)
       25Terms9High19MSM      103.25      (5.3%)        5.02      (0.2%)  -95.1% ( -95% -  -94%)
       25Terms6High18MSM       36.26      (8.2%)        2.17      (0.3%)  -94.0% ( -94% -  -93%)
      25Terms14High15MSM       27.13      (5.4%)        1.70      (0.3%)  -93.7% ( -94% -  -93%)
        15Terms4High9MSM       43.14      (8.0%)        2.70      (0.3%)  -93.7% ( -94% -  -92%)
         5Terms2High4MSM       82.68      (7.1%)        5.65      (0.3%)  -93.2% ( -93% -  -92%)
        15Terms6High9MSM       55.67      (7.0%)        4.01      (0.3%)  -92.8% ( -93% -  -91%)
       15Terms3High12MSM       62.05      (7.0%)        4.51      (0.3%)  -92.7% ( -93% -  -91%)
       20Terms0High20MSM      597.72      (1.7%)       43.89      (0.2%)  -92.7% ( -92% -  -92%)
        10Terms3High6MSM       56.10      (7.5%)        4.13      (0.3%)  -92.6% ( -93% -  -91%)
       20Terms6High13MSM       55.80      (6.8%)        4.14      (0.3%)  -92.6% ( -93% -  -91%)
        10Terms4High6MSM       72.64      (6.5%)        5.95      (0.4%)  -91.8% ( -92% -  -90%)
       20Terms6High12MSM       51.01      (6.9%)        4.28      (0.4%)  -91.6% ( -92% -  -90%)
       20Terms8High10MSM       17.80      (7.4%)        1.53      (0.4%)  -91.4% ( -92% -  -90%)
       20Terms5High14MSM       87.56      (5.5%)        7.74      (0.4%)  -91.2% ( -92% -  -90%)
       15Terms2High13MSM      119.82      (5.5%)       11.57      (0.4%)  -90.3% ( -91% -  -89%)
       15Terms0High15MSM      548.28      (2.9%)       56.01      (0.3%)  -89.8% ( -90% -  -89%)
       20Terms3High15MSM      112.66      (5.2%)       11.93      (0.5%)  -89.4% ( -90% -  -88%)
      20Terms17High17MSM       23.13      (8.8%)        2.55      (0.5%)  -89.0% ( -90% -  -87%)
       20Terms3High13MSM       35.37      (8.4%)        4.09      (0.5%)  -88.4% ( -89% -  -86%)
       15Terms1High13MSM      204.25      (4.7%)       26.27      (0.5%)  -87.1% ( -88% -  -85%)
       25Terms9High10MSM       33.16      (5.3%)        4.59      (0.6%)  -86.2% ( -87% -  -84%)
       15Terms2High11MSM       89.91      (5.8%)       12.49      (0.6%)  -86.1% ( -87% -  -84%)
         5Terms3High4MSM       30.54      (3.8%)        4.35      (0.7%)  -85.8% ( -86% -  -84%)
         5Terms0High5MSM      843.58      (1.8%)      121.72      (0.6%)  -85.6% ( -86% -  -84%)
         5Terms1High4MSM      292.58      (5.9%)       45.87      (0.7%)  -84.3% ( -85% -  -82%)
        10Terms4High5MSM       37.67      (4.9%)        5.94      (0.8%)  -84.2% ( -85% -  -82%)
      15Terms12High12MSM       11.43      (9.4%)        1.90      (0.8%)  -83.4% ( -85% -  -80%)
        15Terms3High6MSM       43.32      (7.4%)        7.50      (0.8%)  -82.7% ( -84% -  -80%)
      15Terms10High10MSM       12.00      (8.4%)        2.14      (0.8%)  -82.2% ( -84% -  -79%)
        10Terms2High5MSM       41.81      (8.2%)        7.54      (0.9%)  -82.0% ( -84% -  -79%)
        15Terms2High7MSM       31.31      (8.4%)        5.83      (0.9%)  -81.4% ( -83% -  -78%)
        20Terms3High9MSM       61.79      (6.0%)       12.05      (0.8%)  -80.5% ( -82% -  -78%)
        10Terms1High6MSM       55.03      (7.8%)       11.26      (1.0%)  -79.5% ( -81% -  -76%)
         5Terms1High3MSM       73.74      (6.4%)       15.42      (1.0%)  -79.1% ( -81% -  -76%)
        15Terms2High6MSM       44.95      (7.2%)       10.28      (1.1%)  -77.1% ( -79% -  -74%)
         5Terms2High3MSM       24.52      (3.8%)        5.78      (1.1%)  -76.4% ( -78% -  -74%)
       20Terms1High13MSM       58.74      (6.4%)       16.13      (1.2%)  -72.5% ( -75% -  -69%)
        25Terms4High7MSM       37.41      (6.6%)       10.40      (1.2%)  -72.2% ( -75% -  -68%)
        15Terms1High9MSM       60.48      (6.8%)       17.85      (1.3%)  -70.5% ( -73% -  -66%)
        15Terms1High5MSM       22.75      (8.5%)        7.22      (1.5%)  -68.3% ( -72% -  -63%)
        20Terms2High8MSM       18.82      (8.2%)        6.33      (1.5%)  -66.4% ( -70% -  -61%)
       20Terms1High11MSM       50.55      (7.0%)       19.27      (1.6%)  -61.9% ( -65% -  -57%)
        10Terms6High6MSM        4.99      (7.3%)        2.02      (2.0%)  -59.4% ( -64% -  -53%)
         5Terms1High2MSM       73.14      (3.2%)       31.07      (2.1%)  -57.5% ( -60% -  -53%)
        20Terms4High5MSM        8.08      (5.7%)        3.47      (2.0%)  -57.0% ( -61% -  -52%)
        10Terms2High3MSM       14.55      (5.1%)        6.40      (2.1%)  -56.0% ( -60% -  -51%)
       15Terms0High11MSM      177.51      (5.1%)       78.53      (1.4%)  -55.8% ( -59% -  -51%)
        15Terms9High8MSM        9.83      (9.1%)        4.72      (2.3%)  -51.9% ( -58% -  -44%)
      20Terms13High11MSM        7.18      (9.0%)        3.51      (2.4%)  -51.1% ( -57% -  -43%)
        10Terms0High8MSM      142.40      (5.4%)       78.02      (1.9%)  -45.2% ( -49% -  -39%)
        10Terms4High4MSM       19.21      (8.9%)       11.12      (2.8%)  -42.1% ( -49% -  -33%)
        15Terms1High4MSM       19.08      (8.4%)       11.44      (2.8%)  -40.0% ( -47% -  -31%)
         5Terms0High4MSM      207.03      (5.9%)      124.76      (2.5%)  -39.7% ( -45% -  -33%)
        15Terms7High6MSM        5.02      (8.7%)        3.14      (3.0%)  -37.5% ( -45% -  -28%)
      20Terms15High12MSM        1.44      (7.1%)        0.90      (3.6%)  -37.2% ( -44% -  -28%)
        10Terms9High7MSM        8.39      (9.1%)        5.40      (3.1%)  -35.7% ( -43% -  -25%)
      15Terms13High10MSM        2.70      (8.5%)        1.75      (3.0%)  -35.2% ( -43% -  -25%)
       20Terms0High14MSM       42.75      (8.0%)       28.29      (2.4%)  -33.8% ( -40% -  -25%)
      25Terms21High16MSM        1.68      (7.6%)        1.12      (3.4%)  -33.4% ( -41% -  -24%)
        20Terms1High4MSM       45.87      (6.2%)       31.85      (2.6%)  -30.6% ( -37% -  -23%)
         5Terms4High4MSM        4.03      (8.6%)        2.93      (3.5%)  -27.2% ( -36% -  -16%)
         5Terms0High3MSM      211.23      (5.6%)      172.13      (3.4%)  -18.5% ( -26% -  -10%)
         5Terms3High3MSM        9.27      (9.8%)        7.81      (4.2%)  -15.7% ( -26% -   -1%)
      15Terms14High10MSM        1.11      (7.0%)        0.97      (5.0%)  -12.9% ( -23% -    0%)
      20Terms17High11MSM        2.96      (8.4%)        2.71      (4.3%)   -8.5% ( -19% -    4%)
        20Terms3High3MSM       14.55      (7.9%)       13.45      (4.1%)   -7.6% ( -18% -    4%)
        10Terms0High6MSM       78.35      (6.4%)       73.87      (3.3%)   -5.7% ( -14% -    4%)
        15Terms0High8MSM       56.41      (6.8%)       53.67      (3.1%)   -4.8% ( -13% -    5%)
      25Terms14High10MSM        0.95      (6.6%)        0.93      (5.4%)   -2.3% ( -13% -   10%)
       25Terms11High5MSM      623.50      (1.6%)      620.72      (1.4%)   -0.4% (  -3% -    2%)
       25Terms14High9MSM        2.99      (8.2%)        2.99      (4.7%)    0.1% ( -11% -   14%)
      20Terms17High10MSM        2.60      (8.3%)        2.67      (4.9%)    3.1% (  -9% -   17%)
        15Terms0High6MSM       70.68      (6.0%)       75.03      (3.6%)    6.2% (  -3% -   16%)
       20Terms13High9MSM        1.02      (6.9%)        1.08      (5.6%)    6.2% (  -5% -   20%)
      25Terms22High14MSM        1.02      (6.9%)        1.09      (5.7%)    6.8% (  -5% -   20%)
        15Terms5High4MSM        4.38      (8.4%)        4.69      (5.1%)    7.1% (  -5% -   22%)
        25Terms0High8MSM       42.57      (6.4%)       48.39      (3.1%)   13.7% (   3% -   24%)
        15Terms0High5MSM       64.99      (6.0%)       76.17      (4.0%)   17.2% (   6% -   28%)
       20Terms12High7MSM        1.59      (7.6%)        1.86      (5.5%)   17.3% (   3% -   32%)
       15Terms14High9MSM        0.78      (6.6%)        0.93      (6.9%)   20.2% (   6% -   36%)
        10Terms8High5MSM        5.14      (9.0%)        6.21      (5.9%)   20.9% (   5% -   39%)
      25Terms20High13MSM        0.58      (5.9%)        0.71      (7.5%)   21.6% (   7% -   37%)
         5Terms4High3MSM        8.61      (9.8%)       10.60      (6.1%)   23.2% (   6% -   43%)
         5Terms2High2MSM       11.50     (10.1%)       14.73      (6.2%)   28.1% (  10% -   49%)
        10Terms0High4MSM       36.94      (8.7%)       47.68      (4.8%)   29.1% (  14% -   46%)
      25Terms22High12MSM        0.82      (6.5%)        1.13      (7.1%)   38.3% (  23% -   55%)
        10Terms4High3MSM        3.38      (8.6%)        4.68      (6.6%)   38.7% (  21% -   58%)
        10Terms7High4MSM        1.25      (7.4%)        1.75      (6.6%)   40.5% (  24% -   58%)
         5Terms0High2MSM       60.10      (8.1%)       88.43      (5.2%)   47.1% (  31% -   65%)
        10Terms0High2MSM       47.57      (7.6%)       72.23      (5.1%)   51.9% (  36% -   69%)
        15Terms5High3MSM        2.52      (8.2%)        3.96      (7.4%)   56.8% (  38% -   78%)
         5Terms3High2MSM        4.25      (9.7%)        6.67      (7.5%)   57.0% (  36% -   82%)
       15Terms11High5MSM        2.63      (8.0%)        4.31      (7.7%)   63.7% (  44% -   86%)
        20Terms9High5MSM        0.87      (6.4%)        1.45      (8.7%)   67.7% (  49% -   88%)
        15Terms9High5MSM        0.82      (6.7%)        1.38      (8.1%)   68.6% (  50% -   89%)
         5Terms4High2MSM        5.49     (10.0%)       10.06      (8.9%)   83.4% (  58% -  113%)
      25Terms23High12MSM        0.35      (5.4%)        0.65     (11.6%)   84.2% (  63% -  106%)
        10Terms8High3MSM        1.29      (7.7%)        2.61      (9.6%)  102.5% (  79% -  129%)
        10Terms5High2MSM        5.07      (9.2%)       10.46      (9.8%)  106.3% (  79% -  138%)
        10Terms9High3MSM        1.12      (7.5%)        2.35      (9.8%)  110.8% (  87% -  138%)
        15Terms8High2MSM        2.56      (8.1%)        5.66     (10.0%)  121.5% (  95% -  151%)
        10Terms9High2MSM        2.07      (7.9%)        4.71     (10.4%)  127.5% ( 101% -  158%)
        15Terms9High2MSM        0.98      (6.7%)        2.26     (10.1%)  131.5% ( 107% -  158%)
       20Terms17High6MSM        0.31      (5.4%)        0.79     (14.8%)  151.9% ( 125% -  181%)
       25Terms12High2MSM        0.68      (5.8%)        1.73     (11.4%)  154.3% ( 129% -  182%)
       25Terms23High7MSM        0.40      (5.4%)        1.02     (13.3%)  155.3% ( 129% -  184%)
       25Terms22High5MSM        0.81      (6.2%)        2.12     (12.1%)  159.8% ( 133% -  190%)
       25Terms24High5MSM        0.32      (5.2%)        0.96     (15.7%)  203.4% ( 173% -  236%)
       25Terms22High2MSM        0.31      (4.6%)        0.97     (15.1%)  215.5% ( 187% -  246%)

Each query draws from low and high freq terms, and eg 25Terms22High14MSM means there are 25 terms int he query, 22 of which are high freq, and minShouldMatch is 14.

Net/net the lower the minShouldMatch, especially vs the number of high-freq terms, the better BS1 is. But BS2 kicks butt otherwise!

asfimport commented 11 years ago

Stefan Pohl (migrated from JIRA)

Thanks, Mike, this behaves as expected. Now we have a sense of what trade-off we'd be going for if we agree on the current model, it is still a hard decision though, entailing questions like:

A few suggestions for a better model which maybe goes beyond the scope of this ticket:

A very conservative usage rule for MSMSumScorer would be to use it only if the constraint is at least one higher than the number of high-freq terms, then it will always "kick butt" and we'd get most bang of this scorer without having slow-downs. But we'd miss out on many cases where it would be faster and those might be the ones that are used in practice by users, and it is not clear (to me:-) what 'high-freq' means. If at all, this should be seen relative to the highest-freq subclause.

More generally, it seems to me the problem we're trying to solve here is identical to computing a cost. If the cost returned by Scorers correlates with execution time, then we could simply call the cost() method on BS and MSMSumScorer and use MSMSumScorer if it is significantly below the former (assuming there are no side-effects in doing these calls). So we'd defer the problem to the individual Scorers, which splits the problem up into smaller subproblems and the Scorers know themselves best about their implementation and behavior.

To make accurate decisions, we probably have to extend the cost-API to return more detailed information to base decision rules on, e.g. upper bound, lower bound (to be able to make conservative/speculative decisions) and estimate the number of returned docs and runtime-correlated cost (in some unit). For instance, MSMSumScorer's overall cost depends on both of the latter and can be split up into the following 2 stages:

  1. Candidate generation = heap-based merge of clause subset, i.e. the same as for DisjSumScorer, but on a clause subset: time to generate all docs from subScorer: correlates with sum over costs of #clauses-(mm-1) least-costly subScorers Number of candidates: [max(...), min(sum(...), maxdoc)], where ... can be either an upper bound, lower bound or an estimate in between of the #candidates returned by the #clauses-(mm-1) subScorers Even for TermScorer, the definition of these two measures are not identical due to the min(..., maxdoc).
  2. Full scoring of candidates: time to advance() and decode postings: (mm-1) * # candidates

The costs would still have to be weighted by the relative overhead of 1) heap-merging, 2) advance() + early-stopping; not sure, if constants are enough here.

While the scope of this topic seems large (modelling all scorers), I currently don't see a simpler way to make this reliably work for arbitrarily structured queries, think of MSM(Disj(MSM(Conj(...),...),...),subtree2,...).

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I really don't really know what the typical/common use cases are for minShouldMatch.

I agree we should err towards BS2, since it can be insanely faster while BS1 can only be \~3X faster (on super-slow queries to begin with), in this test anyway.

A more accurate cost model for scorers would be awesome! This could be a general framework that we'd be able to use for various forms for query optimizing (which we don't do today or do with heuristics), eg things like whether to apply a filter (AND) high vs low, whether to use BS1 or BS2 for pure conjunctions, when to split a PhraseQuery into conjunction + position checking, flattening of nested boolean queries, MultiTermQuery rewrite method, etc. But probably we should explore this on a new issue.

asfimport commented 11 years ago

Stefan Pohl (migrated from JIRA)

I really don't really know what the typical/common use cases are for minShouldMatch.

What about your own great work (http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html) as a use-case to start with? Maybe some consulting committers can also share some insight on how this is used in the wild.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What about your own great work (http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html) as a use-case to start with?

Thank Stefan :) That's sort of a specialized (but very useful) use case I think ... and the minShouldMatch is always N-1.

Maybe some consulting committers can also share some insight on how this is used in the wild.

+1, that'd be great to know!

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I often use min_should_match in practice. Like one example is if you do search for titles or name like POI's or meta-data. Lets take youtube as an example you often get queries like "queen wembley live 1989" which was in-fact 1986 (at least the one I meant here) a pretty good pattern is to use some metric like 80% must match if >= 2 query terms etc. Another good example is if you use shingles a query like "queen wembley live 1989" produces lots of terms and "wembley live" might be pretty common so you want to make sure that you are not returning stuff from other band but on the other hand a pure conjunction is not acceptable here either.

hope that give some insight?

asfimport commented 11 years ago

Eks Dev (migrated from JIRA)

the same pattern like Simon here, just having these terms wrapped in fuzzy/prefix query, often as dismax query.

for example: BQ(boo* OR hoo* OR whatever) with e.g. minShouldMatch = 2

So the only diff to Simon's case is that single boolean clauses are often more complicated then simple TermQuery

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I really don't really know what the typical/common use cases are for minShouldMatch.

One very practical thing is that solr queryparsers (probably elasticsearch has similar ones too?) such as dismax/edismax actually seem to be fully defined in terms of minShouldMatch (with the extremes being handled as OR and AND).

I know Tom Burton-West has experimented with this some on chinese TREC data (he has some comments on SOLR-3589), etc.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.