Open asfimport opened 11 years ago
Robert Muir (@rmuir) (migrated from JIRA)
To really do this right I think we need a better tasks file for luceneutil probably too.
Michael McCandless (@mikemccand) (migrated from JIRA)
I generated some random biggish minShouldMatch queries (attached), and then ran luceneutil perf test where base=trunk (always use BS2 if minShouldMatch > 1) and comp= always use BS1:
Task QPS base StdDev QPS comp StdDev Pct diff
20Terms15High20MSM 273.33 (2.5%) 0.87 (0.0%) -99.7% ( -99% - -99%)
25Terms16High25MSM 328.44 (3.1%) 1.41 (0.0%) -99.6% ( -99% - -99%)
15Terms7High15MSM 378.55 (2.5%) 1.66 (0.0%) -99.6% ( -99% - -99%)
15Terms12High15MSM 437.86 (3.1%) 1.97 (0.0%) -99.6% ( -99% - -99%)
25Terms8High25MSM 304.31 (3.0%) 1.46 (0.0%) -99.5% ( -99% - -99%)
10Terms7High10MSM 560.31 (2.2%) 3.35 (0.0%) -99.4% ( -99% - -99%)
15Terms11High15MSM 529.42 (3.0%) 3.98 (0.0%) -99.2% ( -99% - -99%)
20Terms11High19MSM 214.71 (4.7%) 2.22 (0.0%) -99.0% ( -99% - -98%)
20Terms15High19MSM 138.97 (5.7%) 1.46 (0.1%) -98.9% ( -99% - -98%)
20Terms10High19MSM 377.77 (3.3%) 4.59 (0.1%) -98.8% ( -98% - -98%)
25Terms10High24MSM 101.62 (5.7%) 1.31 (0.1%) -98.7% ( -98% - -98%)
20Terms8High19MSM 108.52 (5.9%) 1.55 (0.1%) -98.6% ( -98% - -98%)
5Terms2High5MSM 383.99 (1.4%) 5.48 (0.1%) -98.6% ( -98% - -98%)
5Terms3High5MSM 355.19 (1.7%) 5.14 (0.1%) -98.6% ( -98% - -98%)
25Terms14High24MSM 234.80 (3.8%) 3.43 (0.1%) -98.5% ( -98% - -98%)
10Terms4High10MSM 802.64 (2.4%) 11.76 (0.1%) -98.5% ( -98% - -98%)
15Terms11High14MSM 252.84 (5.0%) 4.07 (0.1%) -98.4% ( -98% - -98%)
15Terms8High14MSM 286.52 (4.8%) 4.94 (0.1%) -98.3% ( -98% - -98%)
25Terms16High22MSM 83.95 (5.6%) 1.46 (0.1%) -98.3% ( -98% - -98%)
15Terms10High13MSM 200.36 (4.9%) 4.72 (0.1%) -97.6% ( -97% - -97%)
25Terms2High25MSM 400.74 (2.9%) 9.84 (0.1%) -97.5% ( -97% - -97%)
10Terms1High10MSM 443.03 (1.9%) 11.48 (0.1%) -97.4% ( -97% - -97%)
25Terms7High23MSM 119.41 (4.9%) 3.38 (0.1%) -97.2% ( -97% - -96%)
5Terms1High5MSM 791.03 (1.5%) 22.66 (0.1%) -97.1% ( -97% - -96%)
25Terms6High22MSM 65.43 (6.5%) 2.16 (0.2%) -96.7% ( -97% - -96%)
15Terms8High11MSM 49.38 (7.8%) 1.67 (0.2%) -96.6% ( -97% - -96%)
15Terms7High12MSM 87.96 (5.9%) 3.55 (0.2%) -96.0% ( -96% - -95%)
25Terms10High18MSM 55.03 (6.6%) 2.30 (0.2%) -95.8% ( -96% - -95%)
10Terms4High8MSM 133.95 (5.9%) 5.67 (0.2%) -95.8% ( -96% - -95%)
10Terms5High7MSM 62.52 (7.0%) 2.71 (0.2%) -95.7% ( -96% - -95%)
10Terms3High8MSM 76.74 (6.8%) 3.51 (0.2%) -95.4% ( -95% - -94%)
15Terms7High11MSM 73.30 (6.3%) 3.43 (0.2%) -95.3% ( -95% - -94%)
5Terms4High5MSM 310.05 (2.4%) 14.54 (0.2%) -95.3% ( -95% - -94%)
15Terms3High13MSM 256.05 (4.5%) 12.36 (0.2%) -95.2% ( -95% - -94%)
25Terms9High19MSM 103.25 (5.3%) 5.02 (0.2%) -95.1% ( -95% - -94%)
25Terms6High18MSM 36.26 (8.2%) 2.17 (0.3%) -94.0% ( -94% - -93%)
25Terms14High15MSM 27.13 (5.4%) 1.70 (0.3%) -93.7% ( -94% - -93%)
15Terms4High9MSM 43.14 (8.0%) 2.70 (0.3%) -93.7% ( -94% - -92%)
5Terms2High4MSM 82.68 (7.1%) 5.65 (0.3%) -93.2% ( -93% - -92%)
15Terms6High9MSM 55.67 (7.0%) 4.01 (0.3%) -92.8% ( -93% - -91%)
15Terms3High12MSM 62.05 (7.0%) 4.51 (0.3%) -92.7% ( -93% - -91%)
20Terms0High20MSM 597.72 (1.7%) 43.89 (0.2%) -92.7% ( -92% - -92%)
10Terms3High6MSM 56.10 (7.5%) 4.13 (0.3%) -92.6% ( -93% - -91%)
20Terms6High13MSM 55.80 (6.8%) 4.14 (0.3%) -92.6% ( -93% - -91%)
10Terms4High6MSM 72.64 (6.5%) 5.95 (0.4%) -91.8% ( -92% - -90%)
20Terms6High12MSM 51.01 (6.9%) 4.28 (0.4%) -91.6% ( -92% - -90%)
20Terms8High10MSM 17.80 (7.4%) 1.53 (0.4%) -91.4% ( -92% - -90%)
20Terms5High14MSM 87.56 (5.5%) 7.74 (0.4%) -91.2% ( -92% - -90%)
15Terms2High13MSM 119.82 (5.5%) 11.57 (0.4%) -90.3% ( -91% - -89%)
15Terms0High15MSM 548.28 (2.9%) 56.01 (0.3%) -89.8% ( -90% - -89%)
20Terms3High15MSM 112.66 (5.2%) 11.93 (0.5%) -89.4% ( -90% - -88%)
20Terms17High17MSM 23.13 (8.8%) 2.55 (0.5%) -89.0% ( -90% - -87%)
20Terms3High13MSM 35.37 (8.4%) 4.09 (0.5%) -88.4% ( -89% - -86%)
15Terms1High13MSM 204.25 (4.7%) 26.27 (0.5%) -87.1% ( -88% - -85%)
25Terms9High10MSM 33.16 (5.3%) 4.59 (0.6%) -86.2% ( -87% - -84%)
15Terms2High11MSM 89.91 (5.8%) 12.49 (0.6%) -86.1% ( -87% - -84%)
5Terms3High4MSM 30.54 (3.8%) 4.35 (0.7%) -85.8% ( -86% - -84%)
5Terms0High5MSM 843.58 (1.8%) 121.72 (0.6%) -85.6% ( -86% - -84%)
5Terms1High4MSM 292.58 (5.9%) 45.87 (0.7%) -84.3% ( -85% - -82%)
10Terms4High5MSM 37.67 (4.9%) 5.94 (0.8%) -84.2% ( -85% - -82%)
15Terms12High12MSM 11.43 (9.4%) 1.90 (0.8%) -83.4% ( -85% - -80%)
15Terms3High6MSM 43.32 (7.4%) 7.50 (0.8%) -82.7% ( -84% - -80%)
15Terms10High10MSM 12.00 (8.4%) 2.14 (0.8%) -82.2% ( -84% - -79%)
10Terms2High5MSM 41.81 (8.2%) 7.54 (0.9%) -82.0% ( -84% - -79%)
15Terms2High7MSM 31.31 (8.4%) 5.83 (0.9%) -81.4% ( -83% - -78%)
20Terms3High9MSM 61.79 (6.0%) 12.05 (0.8%) -80.5% ( -82% - -78%)
10Terms1High6MSM 55.03 (7.8%) 11.26 (1.0%) -79.5% ( -81% - -76%)
5Terms1High3MSM 73.74 (6.4%) 15.42 (1.0%) -79.1% ( -81% - -76%)
15Terms2High6MSM 44.95 (7.2%) 10.28 (1.1%) -77.1% ( -79% - -74%)
5Terms2High3MSM 24.52 (3.8%) 5.78 (1.1%) -76.4% ( -78% - -74%)
20Terms1High13MSM 58.74 (6.4%) 16.13 (1.2%) -72.5% ( -75% - -69%)
25Terms4High7MSM 37.41 (6.6%) 10.40 (1.2%) -72.2% ( -75% - -68%)
15Terms1High9MSM 60.48 (6.8%) 17.85 (1.3%) -70.5% ( -73% - -66%)
15Terms1High5MSM 22.75 (8.5%) 7.22 (1.5%) -68.3% ( -72% - -63%)
20Terms2High8MSM 18.82 (8.2%) 6.33 (1.5%) -66.4% ( -70% - -61%)
20Terms1High11MSM 50.55 (7.0%) 19.27 (1.6%) -61.9% ( -65% - -57%)
10Terms6High6MSM 4.99 (7.3%) 2.02 (2.0%) -59.4% ( -64% - -53%)
5Terms1High2MSM 73.14 (3.2%) 31.07 (2.1%) -57.5% ( -60% - -53%)
20Terms4High5MSM 8.08 (5.7%) 3.47 (2.0%) -57.0% ( -61% - -52%)
10Terms2High3MSM 14.55 (5.1%) 6.40 (2.1%) -56.0% ( -60% - -51%)
15Terms0High11MSM 177.51 (5.1%) 78.53 (1.4%) -55.8% ( -59% - -51%)
15Terms9High8MSM 9.83 (9.1%) 4.72 (2.3%) -51.9% ( -58% - -44%)
20Terms13High11MSM 7.18 (9.0%) 3.51 (2.4%) -51.1% ( -57% - -43%)
10Terms0High8MSM 142.40 (5.4%) 78.02 (1.9%) -45.2% ( -49% - -39%)
10Terms4High4MSM 19.21 (8.9%) 11.12 (2.8%) -42.1% ( -49% - -33%)
15Terms1High4MSM 19.08 (8.4%) 11.44 (2.8%) -40.0% ( -47% - -31%)
5Terms0High4MSM 207.03 (5.9%) 124.76 (2.5%) -39.7% ( -45% - -33%)
15Terms7High6MSM 5.02 (8.7%) 3.14 (3.0%) -37.5% ( -45% - -28%)
20Terms15High12MSM 1.44 (7.1%) 0.90 (3.6%) -37.2% ( -44% - -28%)
10Terms9High7MSM 8.39 (9.1%) 5.40 (3.1%) -35.7% ( -43% - -25%)
15Terms13High10MSM 2.70 (8.5%) 1.75 (3.0%) -35.2% ( -43% - -25%)
20Terms0High14MSM 42.75 (8.0%) 28.29 (2.4%) -33.8% ( -40% - -25%)
25Terms21High16MSM 1.68 (7.6%) 1.12 (3.4%) -33.4% ( -41% - -24%)
20Terms1High4MSM 45.87 (6.2%) 31.85 (2.6%) -30.6% ( -37% - -23%)
5Terms4High4MSM 4.03 (8.6%) 2.93 (3.5%) -27.2% ( -36% - -16%)
5Terms0High3MSM 211.23 (5.6%) 172.13 (3.4%) -18.5% ( -26% - -10%)
5Terms3High3MSM 9.27 (9.8%) 7.81 (4.2%) -15.7% ( -26% - -1%)
15Terms14High10MSM 1.11 (7.0%) 0.97 (5.0%) -12.9% ( -23% - 0%)
20Terms17High11MSM 2.96 (8.4%) 2.71 (4.3%) -8.5% ( -19% - 4%)
20Terms3High3MSM 14.55 (7.9%) 13.45 (4.1%) -7.6% ( -18% - 4%)
10Terms0High6MSM 78.35 (6.4%) 73.87 (3.3%) -5.7% ( -14% - 4%)
15Terms0High8MSM 56.41 (6.8%) 53.67 (3.1%) -4.8% ( -13% - 5%)
25Terms14High10MSM 0.95 (6.6%) 0.93 (5.4%) -2.3% ( -13% - 10%)
25Terms11High5MSM 623.50 (1.6%) 620.72 (1.4%) -0.4% ( -3% - 2%)
25Terms14High9MSM 2.99 (8.2%) 2.99 (4.7%) 0.1% ( -11% - 14%)
20Terms17High10MSM 2.60 (8.3%) 2.67 (4.9%) 3.1% ( -9% - 17%)
15Terms0High6MSM 70.68 (6.0%) 75.03 (3.6%) 6.2% ( -3% - 16%)
20Terms13High9MSM 1.02 (6.9%) 1.08 (5.6%) 6.2% ( -5% - 20%)
25Terms22High14MSM 1.02 (6.9%) 1.09 (5.7%) 6.8% ( -5% - 20%)
15Terms5High4MSM 4.38 (8.4%) 4.69 (5.1%) 7.1% ( -5% - 22%)
25Terms0High8MSM 42.57 (6.4%) 48.39 (3.1%) 13.7% ( 3% - 24%)
15Terms0High5MSM 64.99 (6.0%) 76.17 (4.0%) 17.2% ( 6% - 28%)
20Terms12High7MSM 1.59 (7.6%) 1.86 (5.5%) 17.3% ( 3% - 32%)
15Terms14High9MSM 0.78 (6.6%) 0.93 (6.9%) 20.2% ( 6% - 36%)
10Terms8High5MSM 5.14 (9.0%) 6.21 (5.9%) 20.9% ( 5% - 39%)
25Terms20High13MSM 0.58 (5.9%) 0.71 (7.5%) 21.6% ( 7% - 37%)
5Terms4High3MSM 8.61 (9.8%) 10.60 (6.1%) 23.2% ( 6% - 43%)
5Terms2High2MSM 11.50 (10.1%) 14.73 (6.2%) 28.1% ( 10% - 49%)
10Terms0High4MSM 36.94 (8.7%) 47.68 (4.8%) 29.1% ( 14% - 46%)
25Terms22High12MSM 0.82 (6.5%) 1.13 (7.1%) 38.3% ( 23% - 55%)
10Terms4High3MSM 3.38 (8.6%) 4.68 (6.6%) 38.7% ( 21% - 58%)
10Terms7High4MSM 1.25 (7.4%) 1.75 (6.6%) 40.5% ( 24% - 58%)
5Terms0High2MSM 60.10 (8.1%) 88.43 (5.2%) 47.1% ( 31% - 65%)
10Terms0High2MSM 47.57 (7.6%) 72.23 (5.1%) 51.9% ( 36% - 69%)
15Terms5High3MSM 2.52 (8.2%) 3.96 (7.4%) 56.8% ( 38% - 78%)
5Terms3High2MSM 4.25 (9.7%) 6.67 (7.5%) 57.0% ( 36% - 82%)
15Terms11High5MSM 2.63 (8.0%) 4.31 (7.7%) 63.7% ( 44% - 86%)
20Terms9High5MSM 0.87 (6.4%) 1.45 (8.7%) 67.7% ( 49% - 88%)
15Terms9High5MSM 0.82 (6.7%) 1.38 (8.1%) 68.6% ( 50% - 89%)
5Terms4High2MSM 5.49 (10.0%) 10.06 (8.9%) 83.4% ( 58% - 113%)
25Terms23High12MSM 0.35 (5.4%) 0.65 (11.6%) 84.2% ( 63% - 106%)
10Terms8High3MSM 1.29 (7.7%) 2.61 (9.6%) 102.5% ( 79% - 129%)
10Terms5High2MSM 5.07 (9.2%) 10.46 (9.8%) 106.3% ( 79% - 138%)
10Terms9High3MSM 1.12 (7.5%) 2.35 (9.8%) 110.8% ( 87% - 138%)
15Terms8High2MSM 2.56 (8.1%) 5.66 (10.0%) 121.5% ( 95% - 151%)
10Terms9High2MSM 2.07 (7.9%) 4.71 (10.4%) 127.5% ( 101% - 158%)
15Terms9High2MSM 0.98 (6.7%) 2.26 (10.1%) 131.5% ( 107% - 158%)
20Terms17High6MSM 0.31 (5.4%) 0.79 (14.8%) 151.9% ( 125% - 181%)
25Terms12High2MSM 0.68 (5.8%) 1.73 (11.4%) 154.3% ( 129% - 182%)
25Terms23High7MSM 0.40 (5.4%) 1.02 (13.3%) 155.3% ( 129% - 184%)
25Terms22High5MSM 0.81 (6.2%) 2.12 (12.1%) 159.8% ( 133% - 190%)
25Terms24High5MSM 0.32 (5.2%) 0.96 (15.7%) 203.4% ( 173% - 236%)
25Terms22High2MSM 0.31 (4.6%) 0.97 (15.1%) 215.5% ( 187% - 246%)
Each query draws from low and high freq terms, and eg 25Terms22High14MSM means there are 25 terms int he query, 22 of which are high freq, and minShouldMatch is 14.
Net/net the lower the minShouldMatch, especially vs the number of high-freq terms, the better BS1 is. But BS2 kicks butt otherwise!
Stefan Pohl (migrated from JIRA)
Thanks, Mike, this behaves as expected. Now we have a sense of what trade-off we'd be going for if we agree on the current model, it is still a hard decision though, entailing questions like:
A few suggestions for a better model which maybe goes beyond the scope of this ticket:
A very conservative usage rule for MSMSumScorer would be to use it only if the constraint is at least one higher than the number of high-freq terms, then it will always "kick butt" and we'd get most bang of this scorer without having slow-downs. But we'd miss out on many cases where it would be faster and those might be the ones that are used in practice by users, and it is not clear (to me:-) what 'high-freq' means. If at all, this should be seen relative to the highest-freq subclause.
More generally, it seems to me the problem we're trying to solve here is identical to computing a cost. If the cost returned by Scorers correlates with execution time, then we could simply call the cost() method on BS and MSMSumScorer and use MSMSumScorer if it is significantly below the former (assuming there are no side-effects in doing these calls). So we'd defer the problem to the individual Scorers, which splits the problem up into smaller subproblems and the Scorers know themselves best about their implementation and behavior.
To make accurate decisions, we probably have to extend the cost-API to return more detailed information to base decision rules on, e.g. upper bound, lower bound (to be able to make conservative/speculative decisions) and estimate the number of returned docs and runtime-correlated cost (in some unit). For instance, MSMSumScorer's overall cost depends on both of the latter and can be split up into the following 2 stages:
The costs would still have to be weighted by the relative overhead of 1) heap-merging, 2) advance() + early-stopping; not sure, if constants are enough here.
While the scope of this topic seems large (modelling all scorers), I currently don't see a simpler way to make this reliably work for arbitrarily structured queries, think of MSM(Disj(MSM(Conj(...),...),...),subtree2,...).
Michael McCandless (@mikemccand) (migrated from JIRA)
I really don't really know what the typical/common use cases are for minShouldMatch.
I agree we should err towards BS2, since it can be insanely faster while BS1 can only be \~3X faster (on super-slow queries to begin with), in this test anyway.
A more accurate cost model for scorers would be awesome! This could be a general framework that we'd be able to use for various forms for query optimizing (which we don't do today or do with heuristics), eg things like whether to apply a filter (AND) high vs low, whether to use BS1 or BS2 for pure conjunctions, when to split a PhraseQuery into conjunction + position checking, flattening of nested boolean queries, MultiTermQuery rewrite method, etc. But probably we should explore this on a new issue.
Stefan Pohl (migrated from JIRA)
I really don't really know what the typical/common use cases are for minShouldMatch.
What about your own great work (http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html) as a use-case to start with? Maybe some consulting committers can also share some insight on how this is used in the wild.
Michael McCandless (@mikemccand) (migrated from JIRA)
What about your own great work (http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html) as a use-case to start with?
Thank Stefan :) That's sort of a specialized (but very useful) use case I think ... and the minShouldMatch is always N-1.
Maybe some consulting committers can also share some insight on how this is used in the wild.
+1, that'd be great to know!
Simon Willnauer (@s1monw) (migrated from JIRA)
I often use min_should_match in practice. Like one example is if you do search for titles or name like POI's or meta-data. Lets take youtube as an example you often get queries like "queen wembley live 1989" which was in-fact 1986 (at least the one I meant here) a pretty good pattern is to use some metric like 80% must match if >= 2 query terms etc. Another good example is if you use shingles a query like "queen wembley live 1989" produces lots of terms and "wembley live" might be pretty common so you want to make sure that you are not returning stuff from other band but on the other hand a pure conjunction is not acceptable here either.
hope that give some insight?
Eks Dev (migrated from JIRA)
the same pattern like Simon here, just having these terms wrapped in fuzzy/prefix query, often as dismax query.
for example: BQ(boo* OR hoo* OR whatever) with e.g. minShouldMatch = 2
So the only diff to Simon's case is that single boolean clauses are often more complicated then simple TermQuery
Robert Muir (@rmuir) (migrated from JIRA)
I really don't really know what the typical/common use cases are for minShouldMatch.
One very practical thing is that solr queryparsers (probably elasticsearch has similar ones too?) such as dismax/edismax actually seem to be fully defined in terms of minShouldMatch (with the extremes being handled as OR and AND).
I know Tom Burton-West has experimented with this some on chinese TREC data (he has some comments on SOLR-3589), etc.
Steven Rowe (@sarowe) (migrated from JIRA)
Bulk move 4.4 issues to 4.5 and 5.0
Uwe Schindler (@uschindler) (migrated from JIRA)
Move issue to Lucene 4.9.
5637 adds a dedicated document-at-time scorer for minNrShouldMatch which can use advance() behind the scenes.
In cases where you have some really common terms and some rare ones this can be a huge performance improvement.
On the other hand BooleanScorer might still be faster in some cases.
We should think about what the logic should be here: one simple thing to do is to always use the new scorer when minShouldMatch is set: thats where i'm leaning.
But maybe we could have a smarter heuristic too, perhaps based on cost()
Migrated from LUCENE-4872 by Robert Muir (@rmuir), 1 vote, updated May 09 2016 Parent: #5637 Attachments: crazyMinShouldMatch.tasks