Closed asfimport closed 9 months ago
Atri Sharma (@atris) (migrated from JIRA)
+1. I was looking at comparisons with Vespa recently and noticed something similar.
Are you planning to work on this? Can take this up if you would like.
Adrien Grand (@jpountz) (migrated from JIRA)
No I'm not, feel free to take it! For what it's worth, Block-Max MaxScore is often reported as being a bit faster despite the fact that it often evaluates more documents that Block-Max WAND since it makes fewer per-document decisions, so it might be interesting to explore implementing it for this bulk scorer.
Zach Chen (@zacharymorn) (migrated from JIRA)
Just curious, I see code in https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L261-L288 and BooleanScorer (extends BulkScorer) are already available for optimized disjunction scoring. Is the goal here to leverage some ideas from Block-Max MaxScore to reduce the per-document decision in the code path above (particularly BooleanScorer) ?
Adrien Grand (@jpountz) (migrated from JIRA)
@zacharymorn Yes that would be one idea. In the BMM paper (http://engineering.nyu.edu/\~suel/papers/bmm.pdf) BMM is usually a bit slower than BMW but not always. I'd be curious to know whether we observe the same result in Lucene.
Since we introduced BMW there have been a few reports that top-level disjunctions got slower. This is usually because there are many clauses in a disjunction that have about the same max score and BMW can hardly skip evaluating documents. In such cases we pay for the BMW overhead without enjoying any benefits. Because BMM has less overhead, I would expect it to perform better in these worst-case scenarios, so I wonder if we should look into using BMM for top-level disjunctions in general.
Zach Chen (@zacharymorn) (migrated from JIRA)
Hi @jpountz, thanks for the additional paper! I gave it a read and found it be to a great summary and refresher for BMM & BMW algorithms, in addition to some other interesting ones! I also saw that the paper mentioned BMM was found faster than BMW when there are 5+ clauses in disjunction, which echoed your observation above as well.
I explored this a bit in the code, and found one heuristics based approach (using iterator cost similarity to proxy / guesstimate block max score similarity across scorers) that seems to give interesting result:
Code changes:
diff --git a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
index bdf085d4669..f71e2c2fdd8 100644
--- a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
+++ b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
`@@` -238,11 +238,33 `@@` final class Boolean2ScorerSupplier extends ScorerSupplier {
//
// However, as WANDScorer uses more complex algorithm and data structure, we would like to
// still use DisjunctionSumScorer to handle exhaustive pure disjunctions, which may be faster
- if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
+ boolean isPureDisjunction =
+ subs.get(Occur.FILTER).isEmpty()
+ && subs.get(Occur.MUST).isEmpty()
+ && subs.get(Occur.MUST_NOT).isEmpty();
+ // top-level boolean term query
+ boolean allTermScorers =
+ optionalScorers.stream().allMatch(scorer -> scorer instanceof TermScorer);
+
+ if (isPureDisjunction && allTermScorers && isSimilarCost(optionalScorers) && minShouldMatch <= 1) {
+ return new DisjunctionMaxScorer(weight, 0.01f, optionalScorers, scoreMode);
+ } else if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
return new WANDScorer(weight, optionalScorers, minShouldMatch, scoreMode);
} else {
return new DisjunctionSumScorer(weight, optionalScorers, scoreMode);
}
}
}
+
+ private boolean isSimilarCost(List<Scorer> optionalScorers) {
+ long minCost = Long.MAX_VALUE;
+ long maxCost = Long.MIN_VALUE;
+ for (Scorer scorer : optionalScorers) {
+ long cost = scorer.iterator().cost();
+ minCost = Math.min(minCost, cost);
+ maxCost = Math.max(maxCost, cost);
+ }
+
+ return maxCost / minCost < 2;
+ }
}
From this change, I have the following luceneutil benchmark result with source wikimedium5m, with also script searchBench.py modified to not fail when hit count differs between baseline and candidate:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
HighTermDayOfYearSort 264.97 (16.9%) 248.25 (9.4%) -6.3% ( -27% - 23%) 0.143
OrHighLow 477.81 (5.7%) 456.76 (6.3%) -4.4% ( -15% - 8%) 0.021
Fuzzy1 82.43 (12.9%) 78.99 (11.0%) -4.2% ( -24% - 22%) 0.271
HighTermTitleBDVSort 237.06 (10.7%) 227.86 (8.9%) -3.9% ( -21% - 17%) 0.214
OrNotHighMed 646.77 (4.1%) 631.45 (4.7%) -2.4% ( -10% - 6%) 0.091
HighTerm 1717.41 (5.4%) 1681.25 (5.4%) -2.1% ( -12% - 9%) 0.219
PKLookup 213.65 (4.0%) 210.55 (3.9%) -1.4% ( -9% - 6%) 0.247
MedTerm 1703.74 (4.0%) 1680.75 (7.7%) -1.3% ( -12% - 10%) 0.487
OrHighNotHigh 679.24 (7.0%) 670.77 (7.0%) -1.2% ( -14% - 13%) 0.572
HighTermMonthSort 111.16 (13.4%) 109.81 (13.5%) -1.2% ( -24% - 29%) 0.775
MedSpanNear 144.50 (2.0%) 142.97 (3.8%) -1.1% ( -6% - 4%) 0.268
OrHighNotLow 738.44 (6.6%) 730.70 (5.5%) -1.0% ( -12% - 11%) 0.586
LowPhrase 255.32 (1.7%) 252.66 (3.0%) -1.0% ( -5% - 3%) 0.169
TermDTSort 255.93 (15.2%) 253.43 (13.3%) -1.0% ( -25% - 32%) 0.829
AndHighMed 404.90 (2.4%) 401.82 (3.5%) -0.8% ( -6% - 5%) 0.423
BrowseMonthTaxoFacets 12.90 (2.9%) 12.80 (3.4%) -0.7% ( -6% - 5%) 0.469
LowSpanNear 85.22 (1.5%) 84.71 (2.0%) -0.6% ( -4% - 2%) 0.285
BrowseDateTaxoFacets 10.72 (2.7%) 10.66 (2.8%) -0.6% ( -5% - 5%) 0.523
HighSpanNear 15.64 (1.9%) 15.55 (2.4%) -0.5% ( -4% - 3%) 0.423
Wildcard 135.32 (2.8%) 134.61 (4.0%) -0.5% ( -7% - 6%) 0.631
MedPhrase 86.95 (1.6%) 86.51 (2.7%) -0.5% ( -4% - 3%) 0.470
AndHighLow 772.76 (3.4%) 769.05 (3.9%) -0.5% ( -7% - 7%) 0.680
LowTerm 1572.57 (4.3%) 1565.62 (4.7%) -0.4% ( -9% - 9%) 0.758
BrowseDayOfYearTaxoFacets 10.74 (2.8%) 10.71 (2.9%) -0.3% ( -5% - 5%) 0.713
OrNotHighLow 1212.21 (3.4%) 1208.45 (4.1%) -0.3% ( -7% - 7%) 0.797
Fuzzy2 52.79 (15.7%) 52.64 (11.6%) -0.3% ( -23% - 32%) 0.949
HighIntervalsOrdered 44.22 (1.3%) 44.12 (1.3%) -0.2% ( -2% - 2%) 0.571
HighPhrase 385.28 (2.7%) 384.37 (3.4%) -0.2% ( -6% - 6%) 0.809
Respell 94.82 (2.7%) 94.64 (3.3%) -0.2% ( -5% - 5%) 0.839
OrNotHighHigh 676.82 (5.0%) 675.66 (6.2%) -0.2% ( -10% - 11%) 0.923
MedSloppyPhrase 27.08 (2.9%) 27.07 (3.3%) -0.0% ( -6% - 6%) 0.980
AndHighHigh 44.82 (3.8%) 44.90 (4.7%) 0.2% ( -8% - 9%) 0.888
BrowseDayOfYearSSDVFacets 27.27 (2.8%) 27.33 (1.9%) 0.2% ( -4% - 4%) 0.757
HighSloppyPhrase 87.76 (2.5%) 88.26 (3.4%) 0.6% ( -5% - 6%) 0.542
Prefix3 123.34 (4.1%) 124.08 (3.7%) 0.6% ( -6% - 8%) 0.628
OrHighNotMed 746.91 (5.7%) 752.11 (7.0%) 0.7% ( -11% - 14%) 0.730
LowSloppyPhrase 329.41 (3.2%) 332.29 (2.3%) 0.9% ( -4% - 6%) 0.324
IntNRQ 304.68 (2.6%) 307.51 (2.1%) 0.9% ( -3% - 5%) 0.219
BrowseMonthSSDVFacets 30.50 (5.9%) 30.87 (2.6%) 1.2% ( -6% - 10%) 0.404
OrHighMed 128.85 (2.5%) 134.50 (8.8%) 4.4% ( -6% - 16%) 0.033
OrHighHigh 57.01 (2.5%) 237.95 (22.2%) 317.4% ( 285% - 351%) 0.000
WARNING: cat=OrHighHigh: hit counts differ: 15200+ vs 28513+
WARNING: cat=OrHighMed: hit counts differ: 3700+ vs 6567+
errors occurred: ([], ['query=body:de body:use filter=None sort=None groupField=None hitCount=15200+: wrong hitCount: 15200+ vs 28513+', 'query=body:27 body:throughout filter=None sort=None groupField=None hitCount=3700+: wrong hitCount: 3700+ vs 6567+'], 1.0)
I have a couple of observations / questions from the results:
Zach Chen (@zacharymorn) (migrated from JIRA)
@atris, not sure if you are working on this currently? Right now I'm just exploring this issue out of curiosity, and hopefully my probe doesn't impact your ongoing work.
Michael McCandless (@mikemccand) (migrated from JIRA)
Not sure if this benchmark result is valid, given the hit count differs?
I think (not certain!) luceneutil verifies the top N hits are the same between candidate and baseline?
Since you are changing the BMW/BMM opto, I think it is expected that the (estimated) total hit count would change? In which case, it's fine to ignore that difference.
Zach Chen (@zacharymorn) (migrated from JIRA)
Thanks Michael for the comment! I went ahead and searched for the error message "wrong hitCount" from above in luceneutil, and found these
So I'm guessing that failure actually happened before verification of top N hits as well as id / scores checks, and thus may mask further potential failures and abort the task early (which also explained I got varying benchmark results across multiple latest runs)?
Adrien Grand (@jpountz) (migrated from JIRA)
DisjunctionMaxScorer
It should be DisjunctionSumScorer if we want to get the same scoring (DisjunctionSumScorer sums up scores of the clauses while DisjunctionMaxScorer takes the max).
Can you run the benchmark with verifyCounts=False like we do for nightlies so that it would only check top hits? https://github.com/mikemccand/luceneutil/blob/fbc9ae15a1bb47f7e15c95fb70f1bda57faccfc1/src/python/nightlyBench.py#L822
Atri Sharma (@atris) (migrated from JIRA)
@zacharymorn I will not be able to take this on for a while. Please feel free to take this on.
Zach Chen (@zacharymorn) (migrated from JIRA)
Thanks Atri for the confirmation!
Zach Chen (@zacharymorn) (migrated from JIRA)
bq.DisjunctionMaxScorer
It should be DisjunctionSumScorer if we want to get the same scoring (DisjunctionSumScorer sums up scores of the clauses while DisjunctionMaxScorer takes the max).
Can you run the benchmark with verifyCounts=False like we do for nightlies so that it would only check top hits? https://github\.com/mikemccand/luceneutil/blob/fbc9ae15a1bb47f7e15c95fb70f1bda57faccfc1/src/python/nightlyBench\.py#L822
Ah I was wondering about the nightly benchmark before, and I see now why Michael suggested the top N hit verification above. Let me give that a try. I didn't use DisjunctionSumScorer earlier as I see it seems to be missing some block related method implementation, but I will use that to run the latest benchmark as well.
Zach Chen (@zacharymorn) (migrated from JIRA)
I made the following changes, and actually still saw varying benchmark result across runs (randomized queries?). I've listed them down below:
Changes in Boolean2ScorerSupplier.java (use DisjunctionSumScorer instead of DisjunctionMaxScorer)
diff --git a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
index bdf085d4669..10478ab45bf 100644
--- a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
+++ b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
`@@` -238,11 +238,34 `@@` final class Boolean2ScorerSupplier extends ScorerSupplier {
//
// However, as WANDScorer uses more complex algorithm and data structure, we would like to
// still use DisjunctionSumScorer to handle exhaustive pure disjunctions, which may be faster
- if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
+ boolean isPureDisjunction =
+ subs.get(Occur.FILTER).isEmpty()
+ && subs.get(Occur.MUST).isEmpty()
+ && subs.get(Occur.MUST_NOT).isEmpty();
+ // top-level boolean term query
+ boolean allTermScorers =
+ optionalScorers.stream().allMatch(scorer -> scorer instanceof TermScorer);
+
+ if (isPureDisjunction && allTermScorers && isSimilarCost(optionalScorers) && minShouldMatch <= 1) {
+ return new DisjunctionSumScorer(weight, optionalScorers, scoreMode);
+ } else if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
return new WANDScorer(weight, optionalScorers, minShouldMatch, scoreMode);
} else {
return new DisjunctionSumScorer(weight, optionalScorers, scoreMode);
}
}
}
+
+ private boolean isSimilarCost(List<Scorer> optionalScorers) {
+ long minCost = Long.MAX_VALUE;
+ long maxCost = Long.MIN_VALUE;
+ for (Scorer scorer : optionalScorers) {
+ long cost = scorer.iterator().cost();
+ minCost = Math.min(minCost, cost);
+ maxCost = Math.max(maxCost, cost);
+ }
+
+ // TODO heuristic based cost-similarity threshold
+ return maxCost / minCost < 2;
+ }
}
Changes in benchUtil.py to not verify counts
diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py
index fb50033..3579f45 100644
--- a/src/python/benchUtil.py
+++ b/src/python/benchUtil.py
`@@` -1203,7 +1203,7 `@@` class RunAlgs:
cmpRawResults, heapCmp = parseResults(cmpLogFiles)
# make sure they got identical results
- cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, self.verifyCounts)
+ cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, False)
baseResults = collateResults(baseRawResults)
cmpResults = collateResults(cmpRawResults)
Benchmark result 1 with source wikimedium5m:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHigh 87.27 (4.9%) 51.12 (1.7%) -41.4% ( -45% - -36%) 0.000
OrHighLow 624.55 (7.6%) 589.16 (8.1%) -5.7% ( -19% - 10%) 0.022
OrHighMed 135.02 (3.7%) 129.51 (6.6%) -4.1% ( -13% - 6%) 0.016
Wildcard 214.30 (3.3%) 209.33 (2.8%) -2.3% ( -8% - 3%) 0.017
OrNotHighHigh 728.60 (8.5%) 713.53 (6.3%) -2.1% ( -15% - 13%) 0.383
HighTerm 1195.98 (6.0%) 1174.51 (4.2%) -1.8% ( -11% - 8%) 0.273
LowTerm 1757.60 (6.0%) 1728.64 (4.8%) -1.6% ( -11% - 9%) 0.336
AndHighMed 231.78 (4.0%) 227.96 (3.7%) -1.6% ( -8% - 6%) 0.175
Prefix3 196.03 (3.4%) 193.19 (3.5%) -1.4% ( -8% - 5%) 0.180
Respell 59.52 (2.8%) 59.05 (2.7%) -0.8% ( -6% - 4%) 0.362
MedTerm 1507.60 (5.3%) 1495.89 (3.3%) -0.8% ( -8% - 8%) 0.580
BrowseDateTaxoFacets 11.04 (3.3%) 10.97 (2.8%) -0.7% ( -6% - 5%) 0.462
BrowseMonthTaxoFacets 13.21 (3.4%) 13.12 (3.9%) -0.7% ( -7% - 6%) 0.542
MedSloppyPhrase 67.05 (3.4%) 66.58 (4.0%) -0.7% ( -7% - 6%) 0.544
IntNRQ 215.89 (4.2%) 214.39 (2.9%) -0.7% ( -7% - 6%) 0.543
BrowseDayOfYearTaxoFacets 11.02 (3.2%) 10.96 (2.7%) -0.6% ( -6% - 5%) 0.546
LowPhrase 193.14 (4.0%) 192.05 (4.5%) -0.6% ( -8% - 8%) 0.678
BrowseDayOfYearSSDVFacets 27.80 (5.2%) 27.67 (5.5%) -0.5% ( -10% - 10%) 0.781
OrHighNotLow 823.92 (6.1%) 820.15 (4.8%) -0.5% ( -10% - 11%) 0.790
PKLookup 215.92 (3.9%) 215.02 (3.8%) -0.4% ( -7% - 7%) 0.734
Fuzzy1 65.82 (7.8%) 65.58 (11.0%) -0.4% ( -17% - 20%) 0.904
HighSloppyPhrase 42.05 (3.9%) 41.91 (3.3%) -0.3% ( -7% - 7%) 0.771
MedSpanNear 155.25 (3.5%) 154.78 (3.3%) -0.3% ( -6% - 6%) 0.779
AndHighHigh 84.97 (4.4%) 84.79 (3.0%) -0.2% ( -7% - 7%) 0.857
LowSloppyPhrase 100.76 (3.6%) 100.55 (3.6%) -0.2% ( -7% - 7%) 0.857
HighIntervalsOrdered 42.39 (3.4%) 42.34 (3.7%) -0.1% ( -6% - 7%) 0.921
HighTermDayOfYearSort 210.65 (15.0%) 210.79 (11.5%) 0.1% ( -23% - 31%) 0.987
HighPhrase 468.21 (4.5%) 468.66 (4.0%) 0.1% ( -7% - 8%) 0.943
HighSpanNear 148.68 (3.6%) 148.94 (3.6%) 0.2% ( -6% - 7%) 0.880
OrNotHighMed 682.83 (7.2%) 684.51 (4.4%) 0.2% ( -10% - 12%) 0.896
OrHighNotMed 733.07 (7.3%) 736.52 (4.9%) 0.5% ( -10% - 13%) 0.811
OrHighNotHigh 638.40 (7.2%) 642.31 (4.5%) 0.6% ( -10% - 13%) 0.747
BrowseMonthSSDVFacets 31.42 (5.2%) 31.65 (2.6%) 0.7% ( -6% - 8%) 0.577
AndHighLow 923.45 (5.9%) 933.64 (5.3%) 1.1% ( -9% - 13%) 0.534
MedPhrase 347.33 (5.7%) 351.57 (3.3%) 1.2% ( -7% - 10%) 0.404
LowSpanNear 311.32 (6.2%) 315.13 (4.2%) 1.2% ( -8% - 12%) 0.466
Fuzzy2 59.05 (12.3%) 60.19 (10.6%) 1.9% ( -18% - 28%) 0.594
OrNotHighLow 851.87 (6.4%) 869.95 (5.6%) 2.1% ( -9% - 15%) 0.263
HighTermMonthSort 99.63 (12.7%) 102.07 (15.1%) 2.5% ( -22% - 34%) 0.578
HighTermTitleBDVSort 161.36 (16.9%) 165.53 (18.3%) 2.6% ( -27% - 45%) 0.643
TermDTSort 195.70 (11.5%) 201.16 (10.4%) 2.8% ( -17% - 27%) 0.420
WARNING: cat=OrHighHigh: hit counts differ: 13070+ vs 357939+
Benchmark result 2 with source wikimedium5m:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy1 69.00 (13.7%) 64.57 (14.5%) -6.4% ( -30% - 25%) 0.150
Fuzzy2 50.00 (15.2%) 47.36 (17.9%) -5.3% ( -33% - 32%) 0.313
OrHighNotLow 780.22 (5.3%) 764.22 (3.9%) -2.1% ( -10% - 7%) 0.165
OrHighLow 235.87 (3.5%) 232.57 (3.2%) -1.4% ( -7% - 5%) 0.187
OrHighNotMed 741.20 (4.0%) 733.20 (4.9%) -1.1% ( -9% - 8%) 0.450
OrNotHighHigh 850.77 (5.9%) 842.95 (5.3%) -0.9% ( -11% - 10%) 0.606
IntNRQ 159.89 (1.5%) 158.54 (3.1%) -0.8% ( -5% - 3%) 0.270
OrHighMed 161.49 (5.0%) 160.24 (6.0%) -0.8% ( -11% - 10%) 0.659
MedTerm 1534.83 (4.9%) 1524.53 (5.7%) -0.7% ( -10% - 10%) 0.691
LowTerm 1854.25 (6.1%) 1842.02 (5.7%) -0.7% ( -11% - 11%) 0.725
Respell 67.40 (1.8%) 66.96 (2.5%) -0.7% ( -4% - 3%) 0.346
Wildcard 222.44 (2.5%) 221.67 (2.3%) -0.3% ( -5% - 4%) 0.645
Prefix3 213.45 (3.6%) 212.73 (3.9%) -0.3% ( -7% - 7%) 0.776
OrHighHigh 62.29 (2.8%) 62.08 (2.5%) -0.3% ( -5% - 5%) 0.700
BrowseDayOfYearSSDVFacets 27.76 (7.0%) 27.69 (6.9%) -0.3% ( -13% - 14%) 0.897
HighSpanNear 108.87 (2.3%) 108.63 (3.8%) -0.2% ( -6% - 6%) 0.820
BrowseMonthTaxoFacets 13.29 (2.3%) 13.30 (1.9%) 0.0% ( -4% - 4%) 0.940
LowSloppyPhrase 171.64 (2.8%) 171.88 (2.9%) 0.1% ( -5% - 6%) 0.875
PKLookup 217.98 (3.9%) 218.52 (3.6%) 0.2% ( -7% - 8%) 0.836
HighTerm 1392.99 (4.5%) 1396.79 (4.5%) 0.3% ( -8% - 9%) 0.847
HighIntervalsOrdered 29.54 (2.7%) 29.63 (2.6%) 0.3% ( -4% - 5%) 0.732
AndHighLow 797.99 (4.0%) 800.55 (3.7%) 0.3% ( -7% - 8%) 0.792
OrNotHighLow 885.90 (4.3%) 888.91 (5.4%) 0.3% ( -8% - 10%) 0.826
MedSloppyPhrase 65.81 (2.9%) 66.06 (3.1%) 0.4% ( -5% - 6%) 0.689
LowSpanNear 59.31 (2.6%) 59.55 (2.5%) 0.4% ( -4% - 5%) 0.609
BrowseDateTaxoFacets 11.19 (2.8%) 11.25 (2.8%) 0.5% ( -5% - 6%) 0.542
LowPhrase 170.61 (2.6%) 171.55 (2.6%) 0.6% ( -4% - 5%) 0.502
MedSpanNear 192.13 (3.2%) 193.32 (2.2%) 0.6% ( -4% - 6%) 0.469
BrowseDayOfYearTaxoFacets 11.20 (2.9%) 11.28 (2.9%) 0.7% ( -4% - 6%) 0.460
HighSloppyPhrase 82.88 (5.4%) 83.47 (5.5%) 0.7% ( -9% - 12%) 0.681
BrowseMonthSSDVFacets 31.59 (4.7%) 31.91 (2.0%) 1.0% ( -5% - 8%) 0.387
MedPhrase 138.53 (2.1%) 140.00 (2.7%) 1.1% ( -3% - 5%) 0.164
HighPhrase 294.46 (2.7%) 297.99 (2.3%) 1.2% ( -3% - 6%) 0.135
OrHighNotHigh 654.84 (5.4%) 663.25 (4.8%) 1.3% ( -8% - 12%) 0.427
HighTermMonthSort 175.68 (10.6%) 178.02 (11.3%) 1.3% ( -18% - 25%) 0.700
HighTermDayOfYearSort 301.64 (16.6%) 306.26 (17.4%) 1.5% ( -27% - 42%) 0.776
OrNotHighMed 694.25 (6.0%) 705.04 (5.6%) 1.6% ( -9% - 13%) 0.396
AndHighHigh 105.76 (2.4%) 107.44 (3.4%) 1.6% ( -4% - 7%) 0.087
TermDTSort 227.39 (12.5%) 232.52 (11.9%) 2.3% ( -19% - 30%) 0.559
AndHighMed 241.36 (2.9%) 246.82 (3.5%) 2.3% ( -4% - 8%) 0.026
HighTermTitleBDVSort 345.18 (13.3%) 361.51 (14.7%) 4.7% ( -20% - 37%) 0.286
Zach Chen (@zacharymorn) (migrated from JIRA)
I made some further changes to move some block max related logic from DisjunctionMaxScorer to DisjunctionScorer, so that DisjunctionSumScorer can inherit. I've published a WIP PR https://github.com/apache/lucene/pull/81 for those changes for the ease of review.
When I run luceneutil, I see further errors from verifyScores section of code, which may indicate bugs in my changes:
WARNING: cat=OrHighHigh: hit counts differ: 9870+ vs 2616+
Traceback (most recent call last):
File "src/python/localrun.py", line 53, in <module>
comp.benchmark("baseline_vs_patch")
File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/competition.py", line 455, in benchmark
randomSeed = self.randomSeed)
File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/searchBench.py", line 196, in run
raise RuntimeError('errors occurred: %s' % str(cmpDiffs))
RuntimeError: errors occurred: ([], ["query=body:second body:short filter=None sort=None groupField=None hitCount=9870+: hit 0 has wrong field/score value ([1444649], '5.0718417') vs ([5125], '4.224689')"], 1.0)
I then made further changes in benchUtil.py to skip over verifyScores, so that I can see what benchmark results it would generate:
diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py
index fb50033..c2faffc 100644
--- a/src/python/benchUtil.py
+++ b/src/python/benchUtil.py
`@@` -1203,7 +1203,7 `@@` class RunAlgs:
cmpRawResults, heapCmp = parseResults(cmpLogFiles)
# make sure they got identical results
- cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, self.verifyCounts)
+ cmpDiffs = compareHits(baseRawResults, cmpRawResults, False, False)
baseResults = collateResults(baseRawResults)
cmpResults = collateResults(cmpRawResults)
I then got the following benchmark results from multiple runs
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighMed 186.44 (2.8%) 160.50 (4.5%) -13.9% ( -20% - -6%) 0.000
OrHighLow 735.70 (7.5%) 696.89 (4.3%) -5.3% ( -15% - 6%) 0.006
Fuzzy1 75.85 (11.5%) 72.81 (14.0%) -4.0% ( -26% - 24%) 0.323
TermDTSort 237.49 (10.4%) 228.02 (10.6%) -4.0% ( -22% - 18%) 0.230
HighTermMonthSort 280.82 (9.8%) 274.90 (10.8%) -2.1% ( -20% - 20%) 0.518
Fuzzy2 54.08 (12.5%) 53.04 (14.2%) -1.9% ( -25% - 28%) 0.648
OrNotHighMed 672.83 (2.7%) 661.16 (4.7%) -1.7% ( -8% - 5%) 0.153
HighTermTitleBDVSort 438.56 (14.4%) 431.81 (16.6%) -1.5% ( -28% - 34%) 0.754
AndHighLow 969.43 (5.2%) 957.49 (4.7%) -1.2% ( -10% - 9%) 0.432
OrNotHighHigh 704.98 (3.4%) 700.72 (3.9%) -0.6% ( -7% - 7%) 0.605
AndHighHigh 109.77 (4.2%) 109.31 (4.7%) -0.4% ( -9% - 8%) 0.767
BrowseMonthSSDVFacets 32.52 (2.1%) 32.40 (4.6%) -0.4% ( -6% - 6%) 0.755
PKLookup 219.90 (3.1%) 219.16 (3.2%) -0.3% ( -6% - 6%) 0.734
Wildcard 284.84 (1.9%) 284.18 (1.8%) -0.2% ( -3% - 3%) 0.690
Prefix3 361.00 (2.1%) 360.24 (2.0%) -0.2% ( -4% - 4%) 0.750
HighIntervalsOrdered 28.68 (2.2%) 28.64 (1.7%) -0.1% ( -3% - 3%) 0.819
BrowseMonthTaxoFacets 13.60 (2.9%) 13.59 (2.7%) -0.1% ( -5% - 5%) 0.947
BrowseDayOfYearSSDVFacets 28.67 (4.8%) 28.66 (4.8%) -0.0% ( -9% - 10%) 0.979
HighSpanNear 79.29 (2.4%) 79.29 (2.2%) 0.0% ( -4% - 4%) 0.997
OrHighNotHigh 695.37 (5.5%) 696.65 (3.8%) 0.2% ( -8% - 10%) 0.903
MedTerm 1478.47 (3.6%) 1481.54 (3.0%) 0.2% ( -6% - 7%) 0.843
HighTermDayOfYearSort 372.12 (14.1%) 373.08 (14.8%) 0.3% ( -25% - 33%) 0.955
IntNRQ 125.36 (1.3%) 125.72 (0.7%) 0.3% ( -1% - 2%) 0.391
LowSpanNear 52.82 (1.7%) 52.98 (2.0%) 0.3% ( -3% - 4%) 0.611
BrowseDayOfYearTaxoFacets 11.28 (3.1%) 11.31 (3.1%) 0.3% ( -5% - 6%) 0.756
LowSloppyPhrase 154.42 (2.9%) 154.91 (2.9%) 0.3% ( -5% - 6%) 0.731
MedPhrase 143.27 (2.9%) 143.88 (2.5%) 0.4% ( -4% - 6%) 0.625
OrHighNotMed 760.65 (6.8%) 763.93 (5.4%) 0.4% ( -10% - 13%) 0.824
Respell 86.71 (1.5%) 87.11 (2.1%) 0.5% ( -3% - 4%) 0.425
MedSpanNear 210.43 (2.2%) 211.43 (1.4%) 0.5% ( -3% - 4%) 0.414
MedSloppyPhrase 220.29 (2.6%) 221.35 (2.2%) 0.5% ( -4% - 5%) 0.528
BrowseDateTaxoFacets 11.24 (3.0%) 11.30 (3.1%) 0.6% ( -5% - 6%) 0.529
LowPhrase 174.98 (2.4%) 176.05 (2.0%) 0.6% ( -3% - 5%) 0.385
HighSloppyPhrase 100.25 (3.2%) 100.88 (3.0%) 0.6% ( -5% - 7%) 0.524
OrHighNotLow 1016.27 (7.7%) 1025.44 (6.9%) 0.9% ( -12% - 16%) 0.696
LowTerm 1634.20 (3.8%) 1649.20 (2.6%) 0.9% ( -5% - 7%) 0.376
HighPhrase 415.55 (3.0%) 419.76 (2.9%) 1.0% ( -4% - 7%) 0.282
OrNotHighLow 940.18 (5.4%) 952.61 (2.7%) 1.3% ( -6% - 9%) 0.328
HighTerm 1163.01 (3.8%) 1178.47 (4.8%) 1.3% ( -6% - 10%) 0.329
AndHighMed 365.15 (4.4%) 370.53 (3.1%) 1.5% ( -5% - 9%) 0.225
OrHighHigh 80.20 (2.2%) 718.16 (158.9%) 795.4% ( 620% - 978%) 0.000
WARNING: cat=OrHighHigh: hit counts differ: 19022+ vs 1002+
WARNING: cat=OrHighMed: hit counts differ: 4321+ vs 4289+
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighLow 667.55 (4.3%) 649.15 (5.9%) -2.8% ( -12% - 7%) 0.092
OrHighNotHigh 866.11 (5.2%) 851.54 (5.2%) -1.7% ( -11% - 9%) 0.305
HighTermDayOfYearSort 293.23 (16.4%) 288.53 (15.0%) -1.6% ( -28% - 35%) 0.747
Fuzzy1 58.36 (10.3%) 57.55 (10.2%) -1.4% ( -19% - 21%) 0.666
OrHighNotLow 709.78 (3.0%) 702.08 (3.9%) -1.1% ( -7% - 5%) 0.322
LowTerm 1816.04 (4.4%) 1797.54 (4.8%) -1.0% ( -9% - 8%) 0.488
Fuzzy2 50.30 (11.0%) 49.85 (11.7%) -0.9% ( -21% - 24%) 0.802
OrHighHigh 55.81 (2.4%) 55.35 (2.4%) -0.8% ( -5% - 4%) 0.288
HighSpanNear 15.16 (2.9%) 15.07 (3.0%) -0.6% ( -6% - 5%) 0.547
MedSpanNear 67.82 (3.1%) 67.47 (3.5%) -0.5% ( -6% - 6%) 0.613
OrHighMed 195.58 (2.7%) 194.60 (2.6%) -0.5% ( -5% - 4%) 0.548
LowSpanNear 36.88 (2.7%) 36.75 (2.9%) -0.3% ( -5% - 5%) 0.690
BrowseMonthTaxoFacets 13.05 (3.3%) 13.01 (3.5%) -0.3% ( -6% - 6%) 0.749
HighIntervalsOrdered 44.33 (1.3%) 44.18 (1.4%) -0.3% ( -3% - 2%) 0.439
HighSloppyPhrase 17.92 (4.4%) 17.87 (4.3%) -0.3% ( -8% - 8%) 0.821
MedPhrase 78.25 (2.5%) 78.12 (2.1%) -0.2% ( -4% - 4%) 0.823
BrowseDayOfYearSSDVFacets 27.73 (2.4%) 27.68 (2.0%) -0.2% ( -4% - 4%) 0.817
PKLookup 213.40 (2.9%) 213.09 (2.4%) -0.1% ( -5% - 5%) 0.862
AndHighHigh 100.38 (2.9%) 100.25 (2.9%) -0.1% ( -5% - 5%) 0.891
BrowseDayOfYearTaxoFacets 10.79 (3.4%) 10.78 (3.7%) -0.1% ( -6% - 7%) 0.912
AndHighLow 778.66 (3.4%) 778.10 (3.1%) -0.1% ( -6% - 6%) 0.945
Wildcard 141.04 (2.1%) 141.00 (2.4%) -0.0% ( -4% - 4%) 0.970
BrowseDateTaxoFacets 10.77 (3.3%) 10.77 (3.6%) -0.0% ( -6% - 7%) 0.993
HighTermTitleBDVSort 222.22 (12.5%) 222.20 (11.7%) -0.0% ( -21% - 27%) 0.998
LowSloppyPhrase 34.64 (3.6%) 34.64 (3.4%) -0.0% ( -6% - 7%) 0.994
IntNRQ 143.05 (0.6%) 143.25 (0.8%) 0.1% ( -1% - 1%) 0.546
BrowseMonthSSDVFacets 30.95 (5.2%) 31.00 (5.0%) 0.2% ( -9% - 10%) 0.922
Respell 66.20 (2.2%) 66.36 (1.8%) 0.2% ( -3% - 4%) 0.719
AndHighMed 300.10 (2.5%) 300.83 (2.9%) 0.2% ( -5% - 5%) 0.775
LowPhrase 54.92 (2.6%) 55.07 (2.1%) 0.3% ( -4% - 5%) 0.701
MedTerm 1522.58 (3.9%) 1529.48 (3.8%) 0.5% ( -7% - 8%) 0.711
OrNotHighLow 965.50 (5.2%) 972.89 (2.8%) 0.8% ( -6% - 9%) 0.564
HighPhrase 163.18 (2.5%) 164.66 (2.6%) 0.9% ( -4% - 6%) 0.259
HighTerm 1400.74 (3.8%) 1417.05 (3.9%) 1.2% ( -6% - 9%) 0.335
MedSloppyPhrase 146.41 (2.6%) 148.13 (2.8%) 1.2% ( -4% - 6%) 0.171
Prefix3 355.38 (3.1%) 359.62 (3.4%) 1.2% ( -5% - 7%) 0.242
OrNotHighHigh 704.32 (4.2%) 713.26 (4.6%) 1.3% ( -7% - 10%) 0.363
TermDTSort 227.11 (13.3%) 230.00 (13.5%) 1.3% ( -22% - 32%) 0.764
OrHighNotMed 724.29 (4.2%) 736.47 (3.8%) 1.7% ( -6% - 10%) 0.183
HighTermMonthSort 134.94 (10.6%) 137.37 (10.2%) 1.8% ( -17% - 25%) 0.583
OrNotHighMed 764.10 (3.7%) 778.89 (3.4%) 1.9% ( -4% - 9%) 0.082
Adrien Grand (@jpountz) (migrated from JIRA)
I made the following changes, and actually still saw varying benchmark result across runs (randomized queries?)
Indeed the benchmark randomly picks queries in the tasks file.
Changes in benchUtil.py to not verify counts
Actually you should be able to do it without modifying the benchmarking code, by configuring your Competition object to not verify counts like that in your localrun file: comp = competition.Competition(verifyCounts=False)
When I run luceneutil, I see further errors from verifyScores section of code, which may indicate bugs in my changes:
Indeed this indicates that the query returns different top hits with your change. If the change was in the order of one ulp, then this could be due to the fact that the sum might depend on the order in which clauses' scores are summed up, but given the significant score difference, there must be a bigger problem. Have you run tests with this change? This could help figure out where the bug is.
Zach Chen (@zacharymorn) (migrated from JIRA)
Actually you should be able to do it without modifying the benchmarking code, by configuring your Competition object to not verify counts like that in your localrun file: {{comp = competition.Competition(verifyCounts=False)}}
Ah I see. Thanks for the tip, will use that going forward!
Indeed this indicates that the query returns different top hits with your change. If the change was in the order of one ulp, then this could be due to the fact that the sum might depend on the order in which clauses' scores are summed up, but given the significant score difference, there must be a bigger problem. Have you run tests with this change? This could help figure out where the bug is.
Yes the ./gradlew check was passing before, but I saw your comment in PR and that calculation was indeed incorrect. Let me correct that and try again.
Zach Chen (@zacharymorn) (migrated from JIRA)
Hi @jpountz, I took a stab at implementing BMM and published a new PR here for further discussion https://github.com/apache/lucene/pull/101 . I'm pretty happy about being able to implement a new scorer, even though its performance is a bit poor (although seems to be on par with the experiment result published in http://engineering.nyu.edu/\~suel/papers/bmm.pdf for BMM and BMW comparison for 2-clause OR query). Shall we consider adding benchmark query set with 5+ clauses to see the performance comparison, as that seems to be when BMM may outperform BMW as the paper suggested?
Adrien Grand (@jpountz) (migrated from JIRA)
I'd be interested in seeing how the results look like with 5+ clauses indeed.
And that also makes me curious about whether we could do better with a BulkScorer. Currently your PR needs to put a lot of code in doAdvance
to reason about max scores, check if we need to move scorers from the list of essential scorers to the list of optional scorers or not, etc. I think that this code has non-negligible overhead since it's called on every match. A BulkScorer would make it easier to only do this sort of things on periodic checkpoints. This is the trick that BooleanScorer
uses: in order to not have to keep a heap constantly ordered, it scores windows of 2048 documents at a time.
Zach Chen (@zacharymorn) (migrated from JIRA)
Makes sense. I guess the general strategy then would be to implement BMM in the BulkScorer, and do the maxScore initialization and essential / non-essential lists partition once and valid only within that 2048 documents boundary. I'll give that a try!
Zach Chen (@zacharymorn) (migrated from JIRA)
I've implemented the above strategy and opened a new PR for it https://github.com/apache/lucene/pull/113. I was using a BulkScorer on top of a collection of Scorers though, instead of a BulkScorer on top of a collection of BulkScorers like BooleanScorer, and hope the difference is due to algorithm difference rather than me misunderstanding the intended usage of BulkScorer interface :D . The result from benchmark util still shows it's slower than WANDScorer for 2 clauses queries, especially for OrHighHigh task.
During the implementation of this BulkScorer I also realized there were some issues with the other PR I published earlier, so I'll fix them next and see if that will give us better result.
Zach Chen (@zacharymorn) (migrated from JIRA)
Hi @jpountz, I've done another pass and fixed a few issues in https://github.com/apache/lucene/pull/101. I tried some other optimizations as well (such as moving scorer from essential to non-essential list every time minCompetitiveScore gets updated), but they didn't seems to improve the benchmark results much for pure disjunction queries in both implementations. Assuming there's no major miss / bug in the two implementations so far, I also feel that compared with BMW, the main bottleneck in BMM for 2-clause OR queries run by the benchmark is indeed the additional frequent operations performed to check and align on the max score boundary.
What do you think? Do you have any suggestion where I should look next?
Zach Chen (@zacharymorn) (migrated from JIRA)
I was trying to modify the CreateQueries class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file wikimedium.10M.nostopwords.tasks to this ticket.
Here are the luceneutil results from 2 runs for each implementation:
Scorer https://github.com/apache/lucene/pull/101
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHighMedHighLow 30.97 (6.2%) 24.92 (4.4%) -19.5% ( -28% - -9%) 0.000
PKLookup 223.53 (2.4%) 228.10 (3.7%) 2.0% ( -3% - 8%) 0.037
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHighMedHighLow 32.83 (3.4%) 34.00 (5.1%) 3.6% ( -4% - 12%) 0.009
PKLookup 217.86 (2.8%) 228.14 (4.2%) 4.7% ( -2% - 12%) 0.000
BulkScorer https://github.com/apache/lucene/pull/113
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 197.84 (4.1%) 207.79 (4.2%) 5.0% ( -3% - 13%) 0.000
OrHighHighMedHighLow 32.50 (16.7%) 35.79 (9.9%) 10.1% ( -14% - 44%) 0.020
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHighMedHighLow 28.61 (5.4%) 22.28 (4.2%) -22.1% ( -30% - -13%) 0.000
PKLookup 227.38 (2.6%) 233.05 (2.7%) 2.5% ( -2% - 8%) 0.003
Adrien Grand (@jpountz) (migrated from JIRA)
Thanks for writing two scorers to test this out! Would you be able to run queries under a profiler to see where your new scorers are spending most time? This might help identify how we could make them faster.
Also thanks for testing with more queries, FWIW it would be good enough to only add 4-5 new queries to the tasks file to play with the change. By the way I'd be curious to see how your new scorers perform with 5 "Med" terms, which should be a worst-case scenario for BMW as all terms should have similar max scores. Since the queries you ran have a "Low" term, I wonder that this term drives iteration, which prevents BMM from showing the lower overhead it has compared to BMW.
Zach Chen (@zacharymorn) (migrated from JIRA)
No problem! Writing these scorers has actually been a great exercise for me to understand more on the scoring related APIs and benchmark testing. I have enjoyed it a lot!
For the profiling, are you referring to JFR? It is currently enabled by default in luceneutil and I've added the result below from 5 "Med" terms queries (queries file wikimedium.10M.nostopwords.tasks.5OrMeds attached) :
BMM Scorer Run 1
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrMedMedMedMedMed 40.66 (9.7%) 32.77 (7.7%) -19.4% ( -33% - -2%) 0.000
PKLookup 215.12 (1.5%) 221.56 (1.8%) 3.0% ( 0% - 6%) 0.000
CPU merged search profile for my_modified_version:
PROFILE SUMMARY from 12153 events (total: 12153)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
4.24% 515 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance()
4.22% 513 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#updateUpToAndMaxScore()
3.11% 378 java.util.LinkedList#listIterator()
2.53% 307 java.util.LinkedList$ListItr#next()
2.42% 294 java.util.zip.Inflater#inflateBytesBytes()
2.15% 261 org.apache.lucene.search.DisiPriorityQueue#pop()
1.60% 195 jdk.internal.misc.Unsafe#getByte()
1.47% 179 org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches()
1.41% 171 java.util.AbstractList$SubList#listIterator()
1.36% 165 java.util.AbstractList#listIterator()
1.31% 159 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.22% 148 org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.21% 147 org.apache.lucene.search.DisiPriorityQueue#add()
1.20% 146 java.util.LinkedList$ListItr#<init>()
1.15% 140 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.11% 135 java.util.LinkedList$ListItr#checkForComodification()
1.08% 131 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.04% 126 java.nio.DirectByteBuffer#get()
1.00% 122 java.lang.Object#wait()
1.00% 122 org.apache.lucene.search.DisiPriorityQueue#downHeap()
0.95% 116 java.util.AbstractList$Itr#<init>()
0.82% 100 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.81% 98 org.apache.lucene.store.ByteBufferGuard#getByte()
0.80% 97 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.73% 89 sun.nio.fs.UnixNativeDispatcher#open0()
0.73% 89 java.lang.ClassLoader#defineClass1()
0.72% 87 java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.69% 84 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.63% 77 jdk.internal.util.ArraysSupport#mismatch()
0.63% 76 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#repartitionLists()
CPU merged search profile for baseline:
PROFILE SUMMARY from 9671 events (total: 9671)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
2.96% 286 java.util.zip.Inflater#inflateBytesBytes()
1.78% 172 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.73% 167 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.72% 166 org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.57% 152 java.lang.Object#wait()
1.57% 152 org.apache.lucene.search.DisiPriorityQueue#add()
1.51% 146 org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.43% 138 java.nio.DirectByteBuffer#get()
1.35% 131 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.15% 111 java.io.RandomAccessFile#readBytes()
1.10% 106 java.lang.ClassLoader#defineClass1()
1.07% 103 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.01% 98 org.apache.lucene.store.ByteBufferGuard#getByte()
1.01% 98 org.apache.lucene.search.WANDScorer#addTail()
1.00% 97 org.apache.lucene.search.WANDScorer#moveToNextCandidate()
1.00% 97 java.io.UnixFileSystem#getBooleanAttributes0()
0.99% 96 sun.nio.fs.UnixNativeDispatcher#open0()
0.93% 90 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.90% 87 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.90% 87 java.util.Arrays#rangeCheck()
0.88% 85 java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.86% 83 org.apache.lucene.search.WANDScorer#advanceTail()
0.84% 81 jdk.internal.util.ArraysSupport#mismatch()
0.80% 77 org.apache.lucene.search.WANDScorer#popTail()
0.78% 75 jdk.internal.misc.Unsafe#getByte()
0.74% 72 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.71% 69 org.apache.lucene.search.DisiPriorityQueue#pop()
0.71% 69 org.apache.lucene.search.WANDScorer$2#matches()
0.65% 63 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#scanToTermLeaf()
0.64% 62 org.apache.lucene.search.DisiPriorityQueue#top()
—
BMM Scorer Run 2
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrMedMedMedMedMed 36.68 (10.0%) 22.26 (10.2%) -39.3% ( -54% - -21%) 0.000
PKLookup 217.96 (9.2%) 221.95 (9.5%) 1.8% ( -15% - 22%) 0.537
CPU merged search profile for my_modified_version:
PROFILE SUMMARY from 15355 events (total: 15355)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
4.04% 621 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance()
3.30% 507 java.util.AbstractList#subList()
3.15% 484 java.util.LinkedList#listIterator()
2.40% 369 java.util.AbstractList$SubList#listIterator()
2.38% 365 org.apache.lucene.search.DisiPriorityQueue#upHeap()
2.29% 351 java.util.zip.Inflater#inflateBytesBytes()
2.10% 323 java.util.Arrays#asList()
1.95% 300 java.util.AbstractList#listIterator()
1.83% 281 jdk.internal.misc.Unsafe#getByte()
1.75% 269 org.apache.lucene.search.WANDScorer#scaleMaxScore()
1.53% 235 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.45% 222 java.util.LinkedList$ListItr#next()
1.39% 213 org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches()
1.35% 208 org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.32% 202 java.util.AbstractList$Itr#<init>()
1.15% 176 org.apache.lucene.search.DisiPriorityQueue#pop()
1.13% 174 java.nio.DirectByteBuffer#get()
1.04% 159 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.03% 158 org.apache.lucene.search.LeafSimScorer#getNormValue()
1.03% 158 java.util.AbstractList$SubList$1#nextIndex()
0.96% 148 org.apache.lucene.search.DisiPriorityQueue#top()
0.96% 147 java.util.AbstractList$Itr#next()
0.94% 145 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
0.83% 127 java.lang.Object#wait()
0.79% 122 org.apache.lucene.store.ByteBufferGuard#getByte()
0.69% 106 org.apache.lucene.store.DataInput#readVInt()
0.66% 102 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.64% 98 java.lang.Math#scalb()
0.64% 98 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.63% 96 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
CPU merged search profile for baseline:
PROFILE SUMMARY from 11671 events (total: 11671)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
2.96% 345 java.util.zip.Inflater#inflateBytesBytes()
2.56% 299 java.io.RandomAccessFile#readBytes()
2.29% 267 org.apache.lucene.search.DisiPriorityQueue#add()
2.04% 238 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
2.01% 235 org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.97% 230 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.91% 223 org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.66% 194 java.nio.DirectByteBuffer#get()
1.48% 173 org.apache.lucene.search.WANDScorer#moveToNextCandidate()
1.31% 153 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.29% 151 org.apache.lucene.search.WANDScorer#addTail()
1.24% 145 java.lang.Object#wait()
1.23% 143 org.apache.lucene.search.DisiPriorityQueue#pop()
1.17% 137 org.apache.lucene.search.WANDScorer#popTail()
1.11% 129 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.04% 121 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
1.01% 118 org.apache.lucene.search.WANDScorer$2#matches()
0.94% 110 org.apache.lucene.search.DisiPriorityQueue#top()
0.91% 106 sun.nio.fs.UnixNativeDispatcher#open0()
0.90% 105 java.io.FileInputStream#readBytes()
0.90% 105 java.lang.ClassLoader#defineClass1()
0.90% 105 org.apache.lucene.store.DataInput#readVInt()
0.90% 105 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.83% 97 org.apache.lucene.store.ByteBufferGuard#getByte()
0.80% 93 org.apache.lucene.search.WANDScorer#advanceTail()
0.75% 88 jdk.internal.misc.Unsafe#getByte()
0.72% 84 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.68% 79 perf.PKLookupTask#go()
0.68% 79 java.io.UnixFileSystem#getBooleanAttributes0()
0.65% 76 org.apache.lucene.search.WANDScorer#upHeapMaxScore()
----
BMM BulkScorer Run 1
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrMedMedMedMedMed 34.58 (16.9%) 28.75 (9.0%) -16.9% ( -36% - 10%) 0.000
PKLookup 223.50 (2.6%) 229.67 (2.6%) 2.8% ( -2% - 8%) 0.001
CPU merged search profile for my_modified_version:
PROFILE SUMMARY from 12720 events (total: 12720)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
3.18% 405 java.util.LinkedList#listIterator()
2.62% 333 java.util.zip.Inflater#inflateBytesBytes()
2.43% 309 org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$1#doAdvance()
2.21% 281 java.util.AbstractList#subList()
2.16% 275 org.apache.lucene.search.WANDScorer#scaleMaxScore()
1.96% 249 org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$2#matches()
1.94% 247 jdk.internal.misc.Unsafe#getByte()
1.71% 217 java.util.Arrays#asList()
1.68% 214 java.util.AbstractList#listIterator()
1.66% 211 java.util.AbstractList$SubList#listIterator()
1.63% 207 java.util.LinkedList$ListItr#next()
1.37% 174 org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.26% 160 java.lang.Object#wait()
1.16% 148 org.apache.lucene.search.LeafSimScorer#getNormValue()
1.12% 143 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.04% 132 java.nio.DirectByteBuffer#get()
1.03% 131 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.03% 131 java.util.AbstractList$Itr#<init>()
0.96% 122 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
0.91% 116 org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer#updateIntervalBoundary()
0.90% 114 jdk.internal.util.ArraysSupport#mismatch()
0.89% 113 org.apache.lucene.search.DisiPriorityQueue#pop()
0.88% 112 java.util.AbstractList$SubList$1#nextIndex()
0.83% 105 org.apache.lucene.search.DisiPriorityQueue#downHeap()
0.74% 94 java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.73% 93 org.apache.lucene.store.ByteBufferGuard#getByte()
0.73% 93 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.72% 91 sun.nio.fs.UnixNativeDispatcher#open0()
0.72% 91 org.apache.lucene.search.DisiPriorityQueue#add()
0.68% 86 org.apache.lucene.search.DisiPriorityQueue#top()
CPU merged search profile for baseline:
PROFILE SUMMARY from 9996 events (total: 9996)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
3.16% 316 java.util.zip.Inflater#inflateBytesBytes()
1.89% 189 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.74% 174 org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.73% 173 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.65% 165 org.apache.lucene.search.DisiPriorityQueue#add()
1.63% 163 org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.62% 162 java.io.RandomAccessFile#readBytes()
1.46% 146 java.lang.Object#wait()
1.41% 141 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.21% 121 org.apache.lucene.search.WANDScorer#addTail()
1.16% 116 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.11% 111 jdk.internal.util.ArraysSupport#mismatch()
1.04% 104 java.nio.DirectByteBuffer#get()
1.02% 102 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
1.00% 100 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.98% 98 org.apache.lucene.search.WANDScorer$2#matches()
0.94% 94 org.apache.lucene.search.DisiPriorityQueue#pop()
0.94% 94 org.apache.lucene.search.WANDScorer#popTail()
0.91% 91 org.apache.lucene.search.DisiPriorityQueue#top()
0.86% 86 org.apache.lucene.util.BytesRef#compareTo()
0.85% 85 org.apache.lucene.store.ByteBufferGuard#getByte()
0.82% 82 java.io.UnixFileSystem#getBooleanAttributes0()
0.79% 79 sun.nio.fs.UnixNativeDispatcher#open0()
0.78% 78 jdk.internal.misc.Unsafe#getByte()
0.77% 77 java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.73% 73 java.lang.ClassLoader#defineClass1()
0.73% 73 org.apache.lucene.search.WANDScorer#advanceTail()
0.70% 70 org.apache.lucene.search.WANDScorer#moveToNextCandidate()
0.69% 69 java.lang.invoke.InvokerBytecodeGenerator#emitStaticInvoke()
0.67% 67 perf.PKLookupTask#go()
—
BMM BulkScorer Run 2
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrMedMedMedMedMed 38.91 (10.4%) 22.19 (4.1%) -43.0% ( -52% - -31%) 0.000
PKLookup 220.07 (5.4%) 226.90 (3.1%) 3.1% ( -5% - 12%) 0.026
CPU merged search profile for my_modified_version:
PROFILE SUMMARY from 15143 events (total: 15143)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
3.94% 596 org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$1#doAdvance()
3.52% 533 java.util.AbstractList#subList()
3.20% 485 java.util.LinkedList#listIterator()
2.81% 426 java.util.AbstractList$SubList#listIterator()
2.72% 412 org.apache.lucene.search.WANDScorer#scaleMaxScore()
2.44% 369 java.util.AbstractList#listIterator()
2.37% 359 org.apache.lucene.search.DisiPriorityQueue#upHeap()
2.27% 343 java.util.Arrays#asList()
2.15% 326 java.util.zip.Inflater#inflateBytesBytes()
1.71% 259 jdk.internal.misc.Unsafe#getByte()
1.44% 218 java.util.AbstractList$Itr#<init>()
1.40% 212 org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.35% 205 java.util.LinkedList$ListItr#next()
1.33% 202 org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$2#matches()
1.31% 199 org.apache.lucene.search.DisiPriorityQueue#pop()
1.20% 182 java.nio.DirectByteBuffer#get()
1.10% 167 java.util.AbstractList$SubList$1#nextIndex()
1.07% 162 java.util.AbstractList$Itr#next()
0.99% 150 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
0.96% 146 org.apache.lucene.search.LeafSimScorer#getNormValue()
0.94% 143 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
0.92% 139 org.apache.lucene.search.DisiPriorityQueue#top()
0.90% 137 java.lang.Object#wait()
0.79% 119 org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer#updateIntervalBoundary()
0.77% 117 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
0.71% 108 java.util.AbstractList$SubList$1#<init>()
0.71% 108 org.apache.lucene.store.ByteBufferGuard#getByte()
0.67% 101 sun.nio.fs.UnixNativeDispatcher#open0()
0.66% 100 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.61% 93 java.lang.ClassLoader#defineClass1()
CPU merged search profile for baseline:
PROFILE SUMMARY from 10813 events (total: 10813)
tests.profile.mode=cpu
tests.profile.count=30
tests.profile.stacksize=1
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
2.67% 289 org.apache.lucene.search.DisiPriorityQueue#add()
2.57% 278 java.util.zip.Inflater#inflateBytesBytes()
2.23% 241 org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.96% 212 org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.91% 207 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.65% 178 java.lang.Object#wait()
1.49% 161 java.nio.DirectByteBuffer#get()
1.36% 147 org.apache.lucene.search.DisiPriorityQueue#top()
1.34% 145 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.29% 139 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.29% 139 org.apache.lucene.search.WANDScorer#addTail()
1.28% 138 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.17% 126 org.apache.lucene.search.WANDScorer#moveToNextCandidate()
1.16% 125 org.apache.lucene.search.DisiPriorityQueue#pop()
1.15% 124 org.apache.lucene.search.WANDScorer$2#matches()
1.09% 118 org.apache.lucene.search.WANDScorer#popTail()
1.04% 112 org.apache.lucene.search.WANDScorer#advanceTail()
1.00% 108 org.apache.lucene.store.ByteBufferGuard#getByte()
0.96% 104 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.95% 103 java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.92% 99 sun.nio.fs.UnixNativeDispatcher#open0()
0.89% 96 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.86% 93 java.lang.ClassLoader#defineClass1()
0.83% 90 jdk.internal.misc.Unsafe#getByte()
0.80% 86 java.util.Arrays#compareUnsigned()
0.79% 85 org.apache.lucene.search.WANDScorer#upHeapMaxScore()
0.76% 82 perf.PKLookupTask#go()
0.68% 74 org.apache.lucene.search.WANDScorer#pushBackLeads()
0.68% 74 java.io.RandomAccessFile#readBytes()
0.68% 74 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score()
I think what stands out from these comparisons is, the 2 BMM implementations spent a lot of CPU cycles in checking and advancing to the next candidate doc, whereas WAND spent time in advancing at the posting reader level. Maybe the BMM implementations did not propagate minScore correctly / effectively to the underlying scorers so that a lot more candidate docs were surfaced from TermScorer? Another possibility is BMM may indeed require more operations than BMW, as suggested by the paper. I will check more and get back to this...
Zach Chen (@zacharymorn) (migrated from JIRA)
Just want to provide a quick summary of the latest progress of this issue. Currently there are 3 different BMM implementations from 2 PRs:
#1
@jpountz what do you think about the above results as well as the latest changes, and any other idea we would like to try on? From the current results it appears option 1 might be the one to go with? I can start to work on productizing the changes and adding tests if we have settled down on the implementation approach here.
Adrien Grand (@jpountz) (migrated from JIRA)
For some reasons it hurt Fuzzy1 & Fuzzy2 performance consistently by around 8%-13%, even though it wasn't used for those queries
Are you sure? I believe that fuzzy queries rewrite to boolean queries, so they would use your new block-max maxscore under the hood?
Zach Chen (@zacharymorn) (migrated from JIRA)
Are you sure? I believe that fuzzy queries rewrite to boolean queries, so they would use your new block-max maxscore under the hood?
Hmm I verified that by throwing runtime exception in the BMM BulkScorer's constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which completed successfully. I feel the slow down may come from the checks to see if BMM is applicable. Let me take a further look there.
Zach Chen (@zacharymorn) (migrated from JIRA)
I see why Fuzzy1 & Fuzzy2 did not trigger BMM scorer / bulkScorer now. Those queries were rewritten into boolean queries with boosting (BoostQuery), but in the BMM eligibility check I had check for TermQuery directly https://github.com/apache/lucene/pull/113/files#diff-d500c30048128831b0fe3c53d9bb74eed7d8063e81d33737b26dcd00bc7f1fd2R337 , hence the BMM scorer / bulkScorer were not invoked for them.
Also likely the looping in that check hurt performance for both implementations, as fuzzy queries can expand into ones with many subqueries (one instance I saw was 50 subqueries), and the current logic would go through all subqueries.
Zach Chen (@zacharymorn) (migrated from JIRA)
I made some changes to the BulkScorer implementations to return false for BMM eligibility immediately when non term query was identified, and they improved the benchmark results for Fuzzy1 & Fuzzy2 a bit (https://github.com/apache/lucene/pull/113/commits/f4115f78be0833b65694ad6a0f9f4f32565091e7). However, it appears that Fuzzy1 & Fuzzy2 benchmark results would vary more in general across runs / queries used compared to other tasks.
Zach Chen (@zacharymorn) (migrated from JIRA)
@jpountz what do you think about the results we got so far? If we are good with the trade-off and performance improvement BMM has for OrHighHigh and __ OrHighMed queries, I can work on productizing the changes next.
Adrien Grand (@jpountz) (migrated from JIRA)
The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster?
I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers:
BMW
AVG: 1.0851470951E7
Median: 5552285
P75: 12087216
P90: 26834970
P95: 40460199
P99: 77821369
Collected AVG: 8168.523
Collected Median: 2259
Collected P75: 3735
Collected P90: 6228
Collected P95: 13063
Collected P99: 221894
BMM - scorer
AVG: 3.0175025635E7
Median: 18845216
P75: 37770072
P90: 77785996
P95: 98711520
P99: 181530180
Collected AVG: 50089.09
Collected Median: 20322
Collected P75: 57955
Collected P90: 125052
Collected P95: 194696
Collected P99: 442811
BMM - bulk scorer
AVG: 5.3372459518E7
Median: 18658182
P75: 60750919
P90: 143040509
P95: 227538646
P99: 461590829
Collected AVG: 525419.23
Collected Median: 109750
Collected P75: 563404
Collected P90: 1651320
Collected P95: 2597310
Collected P99: 4508467
Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM?
EDIT: I had first got wrong results for BMM scorer because I had run with an old commit of the PR, this is now fixed.
Zach Chen (@zacharymorn) (migrated from JIRA)
The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster?
Ah not sure why I didn't think of running them through BMM earlier! I just gave them a run, and got the following results:
BMM Scorer
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy1 30.46 (24.7%) 17.63 (11.6%) -42.1% ( -62% - -7%) 0.000
Fuzzy2 21.61 (16.4%) 16.28 (12.0%) -24.7% ( -45% - 4%) 0.000
PKLookup 216.72 (4.1%) 215.63 (3.0%) -0.5% ( -7% - 6%) 0.654
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy1 30.58 (9.1%) 22.12 (6.4%) -27.7% ( -39% - -13%) 0.000
Fuzzy2 36.07 (12.7%) 27.05 (10.8%) -25.0% ( -42% - -1%) 0.000
PKLookup 215.26 (3.4%) 213.99 (2.5%) -0.6% ( -6% - 5%) 0.530
BMMBulkScorer without window (with the above scorer implementation)
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy2 16.32 (22.6%) 15.68 (16.3%) -3.9% ( -34% - 45%) 0.527
Fuzzy1 48.11 (17.6%) 47.48 (13.6%) -1.3% ( -27% - 36%) 0.791
PKLookup 213.67 (3.2%) 212.52 (4.0%) -0.5% ( -7% - 6%) 0.640
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy2 26.99 (23.2%) 24.75 (13.6%) -8.3% ( -36% - 37%) 0.169
PKLookup 216.27 (4.3%) 216.43 (3.4%) 0.1% ( -7% - 8%) 0.951
Fuzzy1 19.01 (24.2%) 20.01 (14.2%) 5.3% ( -26% - 57%) 0.400
BMMBulkScorer with window size 1024
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy2 23.56 (26.0%) 19.08 (13.9%) -19.0% ( -46% - 28%) 0.004
Fuzzy1 30.97 (31.6%) 25.82 (16.9%) -16.6% ( -49% - 46%) 0.038
PKLookup 213.23 (2.5%) 211.63 (1.8%) -0.7% ( -5% - 3%) 0.289
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy1 20.59 (12.1%) 20.59 (10.5%) -0.0% ( -20% - 25%) 0.994
PKLookup 205.21 (3.1%) 206.99 (3.7%) 0.9% ( -5% - 7%) 0.422
Fuzzy2 30.74 (22.7%) 32.71 (17.0%) 6.4% ( -27% - 59%) 0.311
These results look strange to me actually, as I would imagine the BulkScorer without window one to perform similarly with the scorer one, as it was just using the scorer implementation under the hood. I'll need to dive into it more to understand what contributed to these difference (their JFR CPU recordings look similar too).
From the results I got now, it seems BMM may not be ideal for handling queries with many terms. My high level guess is that with these queries that can be rewritten into boolean queries with \~50 terms, BMM may find itself spending lots of time to compute upTo and update maxScore, as the minimum of all block boundaries of scorers were used to update upTo each time. This can explain why the bulkScorer implementation with a fixed window size has better performance than the scorer one, but doesn't explain the difference above.
I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers:
Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM?
Thanks for running these additional tests! The results indeed look interesting. I took a look at the MSMarcoPassages.java code you attached, and wonder if it's also possible that, since the percentile numbers were computed after sort, for some low percentile (P10 for example) BMM can do much better, and for the rest (at least 50% of them), worse than BMW?
I also notice that BMM BulkScorer collects roughly 10X the amount of docs compared with BMM scorer, which in turn also collects > 10X the amount of docs compared with BMW. I feel this may also explain the unexpected slow down? In general I would assume these scorers to all collect the same amount of top docs.
Also, I'm interested to run these benchmark tests as well. Are these passages data set and queries used available for download somewhere (I found the MS github site, but not sure if that has the same version with the one you used)?
Adrien Grand (@jpountz) (migrated from JIRA)
I also notice that BMM BulkScorer collects roughly 10X the amount of docs compared with BMM scorer, which in turn also collects > 10X the amount of docs compared with BMW. I feel this may also explain the unexpected slow down? In general I would assume these scorers to all collect the same amount of top docs.
Actually this matches my expectation. BMM and BMW differ in that BMM only makes a decision about which scorers lead iteration once per block, while BMW needs to make decisions on every document. So BMM collects more documents than BMW but BMW takes the risk that trying to be too smart makes things slower than a simpler approach.
Are these passages data set and queries used available for download somewhere
Yes. You can download the "Collection" and "Queries" files from https://microsoft.github.io/msmarco/#ranking (make sure to accept terms at the top first so that download links are active).
Zach Chen (@zacharymorn) (migrated from JIRA)
Actually this matches my expectation. BMM and BMW differ in that BMM only makes a decision about which scorers lead iteration once per block, while BMW needs to make decisions on every document. So BMM collects more documents than BMW but BMW takes the risk that trying to be too smart makes things slower than a simpler approach.
Ok I also took a further look at the TopDocsCollector code, and confirmed that I had an incorrect understanding of "collect" and "hit count" here earlier. This (and Michael's earlier response) totally makes sense now!
Yes. You can download the "Collection" and "Queries" files from https://microsoft\.github\.io/msmarco/#ranking (make sure to accept terms at the top first so that download links are active).
Thanks! I was able to download them. Will explore a bit more to see how they can be improved further.
Zach Chen (@zacharymorn) (migrated from JIRA)
Hi @jpountz, I've tried out a few ideas in the last few days and they gave some improvements (but also made it worse for OrMedMedMedMedMed). However, it was still not as performing as BMW for MSMARCO passages dataset. The ideas I tried include:
Collectively, these gave 70\~90% performance boost to OrHighHigh, 60\~150% for OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of OrMedMedMedMedMed performance (by -40% with #4
changes).
For MSMARCO passages dataset, they now give the following results (modified slightly from your version to show more percentile, and to add comma to separate digits for readability):
BMW Scorer
AVG: 23,252,992.375
P25: 6,298,463
P50: 13,007,148
P75: 26,868,222
P90: 56,683,505
P95: 84,333,397
P99: 154,185,321
Collected AVG: 8,168.523
Collected P25: 1,548
Collected P50: 2,259
Collected P75: 3,735
Collected P90: 6,228
Collected P95: 13,063
Collected P99: 221,894
BMM Scorer
AVG: 41,970,641.638
P25: 8,654,210
P50: 21,553,366
P75: 51,519,172
P90: 109,510,378
P95: 154,534,017
P99: 266,141,446
Collected AVG: 16,810.392
Collected P25: 2,769
Collected P50: 7,159
Collected P75: 20,077
Collected P90: 43,031
Collected P95: 69,984
Collected P99: 135,253
I've also attached "JFR result for BMM scorer with optimizations May 22" for the BMM scorer profiling result from the latest changes. Overall, it seems that the larger number of docs collected by BMM is becoming a bottleneck for performance, as around 50% of the computation was spent by SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute score for candidate doc (around 34% of the computation was spent to find the next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more docs faster, it should be able to improve BMM further.
Implemented.
Lucene often gets benchmarked against other engines, e.g. against Tantivy and PISA at https://tantivy-search.github.io/bench/ or against research prototypes in Table 1 of https://cs.uwaterloo.ca/\~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf. Given that top-level disjunctions of term queries are commonly used for benchmarking, it would be nice to optimize this case a bit more, I suspect that we could make fewer per-document decisions by implementing a BulkScorer instead of a Scorer.
Migrated from LUCENE-9335 by Adrien Grand (@jpountz), updated May 23 2021 Attachments: JFR result for BMM scorer with optimizations May 22.png, MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, wikimedium.10M.nostopwords.tasks.5OrMeds Pull requests: https://github.com/apache/lucene/pull/81, https://github.com/apache/lucene/pull/101, https://github.com/apache/lucene/pull/113