Add a bulk scorer for disjunctions that does dynamic pruning [LUCENE-9335]

Lucene often gets benchmarked against other engines, e.g. against Tantivy and PISA at https://tantivy-search.github.io/bench/ or against research prototypes in Table 1 of https://cs.uwaterloo.ca/\~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf. Given that top-level disjunctions of term queries are commonly used for benchmarking, it would be nice to optimize this case a bit more, I suspect that we could make fewer per-document decisions by implementing a BulkScorer instead of a Scorer.

JFR result for BMM scorer with optimizations May 22.png

Migrated from LUCENE-9335 by Adrien Grand (@jpountz), updated May 23 2021 Attachments: JFR result for BMM scorer with optimizations May 22.png, MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, wikimedium.10M.nostopwords.tasks.5OrMeds Pull requests: https://github.com/apache/lucene/pull/81, https://github.com/apache/lucene/pull/101, https://github.com/apache/lucene/pull/113

Atri Sharma (@atris) (migrated from JIRA)

+1. I was looking at comparisons with Vespa recently and noticed something similar.

Are you planning to work on this? Can take this up if you would like.

Adrien Grand (@jpountz) (migrated from JIRA)

No I'm not, feel free to take it! For what it's worth, Block-Max MaxScore is often reported as being a bit faster despite the fact that it often evaluates more documents that Block-Max WAND since it makes fewer per-document decisions, so it might be interesting to explore implementing it for this bulk scorer.

Zach Chen (@zacharymorn) (migrated from JIRA)

Just curious, I see code in https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L261-L288 and BooleanScorer (extends BulkScorer) are already available for optimized disjunction scoring. Is the goal here to leverage some ideas from Block-Max MaxScore to reduce the per-document decision in the code path above (particularly BooleanScorer) ?

Adrien Grand (@jpountz) (migrated from JIRA)

@zacharymorn Yes that would be one idea. In the BMM paper (http://engineering.nyu.edu/\~suel/papers/bmm.pdf) BMM is usually a bit slower than BMW but not always. I'd be curious to know whether we observe the same result in Lucene.

Since we introduced BMW there have been a few reports that top-level disjunctions got slower. This is usually because there are many clauses in a disjunction that have about the same max score and BMW can hardly skip evaluating documents. In such cases we pay for the BMW overhead without enjoying any benefits. Because BMM has less overhead, I would expect it to perform better in these worst-case scenarios, so I wonder if we should look into using BMM for top-level disjunctions in general.

Zach Chen (@zacharymorn) (migrated from JIRA)

Hi @jpountz, thanks for the additional paper! I gave it a read and found it be to a great summary and refresher for BMM & BMW algorithms, in addition to some other interesting ones! I also saw that the paper mentioned BMM was found faster than BMW when there are 5+ clauses in disjunction, which echoed your observation above as well.

I explored this a bit in the code, and found one heuristics based approach (using iterator cost similarity to proxy / guesstimate block max score similarity across scorers) that seems to give interesting result:

Code changes:

diff --git a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
index bdf085d4669..f71e2c2fdd8 100644
--- a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
+++ b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
`@@` -238,11 +238,33 `@@` final class Boolean2ScorerSupplier extends ScorerSupplier {
       //
       // However, as WANDScorer uses more complex algorithm and data structure, we would like to
       // still use DisjunctionSumScorer to handle exhaustive pure disjunctions, which may be faster
-      if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
+      boolean isPureDisjunction =
+              subs.get(Occur.FILTER).isEmpty()
+                      && subs.get(Occur.MUST).isEmpty()
+                      && subs.get(Occur.MUST_NOT).isEmpty();
+      // top-level boolean term query
+      boolean allTermScorers =
+              optionalScorers.stream().allMatch(scorer -> scorer instanceof TermScorer);
+
+      if (isPureDisjunction && allTermScorers && isSimilarCost(optionalScorers) && minShouldMatch <= 1) {
+        return new DisjunctionMaxScorer(weight, 0.01f, optionalScorers, scoreMode);
+      } else if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
         return new WANDScorer(weight, optionalScorers, minShouldMatch, scoreMode);
       } else {
         return new DisjunctionSumScorer(weight, optionalScorers, scoreMode);
       }
     }
   }
+
+  private boolean isSimilarCost(List<Scorer> optionalScorers) {
+    long minCost = Long.MAX_VALUE;
+    long maxCost = Long.MIN_VALUE;
+    for (Scorer scorer : optionalScorers) {
+      long cost = scorer.iterator().cost();
+      minCost = Math.min(minCost, cost);
+      maxCost = Math.max(maxCost, cost);
+    }
+
+    return maxCost / minCost < 2;
+  }
 }

From this change, I have the following luceneutil benchmark result with source wikimedium5m, with also script searchBench.py modified to not fail when hit count differs between baseline and candidate:

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
   HighTermDayOfYearSort      264.97     (16.9%)      248.25      (9.4%)   -6.3% ( -27% -   23%) 0.143
               OrHighLow      477.81      (5.7%)      456.76      (6.3%)   -4.4% ( -15% -    8%) 0.021
                  Fuzzy1       82.43     (12.9%)       78.99     (11.0%)   -4.2% ( -24% -   22%) 0.271
    HighTermTitleBDVSort      237.06     (10.7%)      227.86      (8.9%)   -3.9% ( -21% -   17%) 0.214
            OrNotHighMed      646.77      (4.1%)      631.45      (4.7%)   -2.4% ( -10% -    6%) 0.091
                HighTerm     1717.41      (5.4%)     1681.25      (5.4%)   -2.1% ( -12% -    9%) 0.219
                PKLookup      213.65      (4.0%)      210.55      (3.9%)   -1.4% (  -9% -    6%) 0.247
                 MedTerm     1703.74      (4.0%)     1680.75      (7.7%)   -1.3% ( -12% -   10%) 0.487
           OrHighNotHigh      679.24      (7.0%)      670.77      (7.0%)   -1.2% ( -14% -   13%) 0.572
       HighTermMonthSort      111.16     (13.4%)      109.81     (13.5%)   -1.2% ( -24% -   29%) 0.775
             MedSpanNear      144.50      (2.0%)      142.97      (3.8%)   -1.1% (  -6% -    4%) 0.268
            OrHighNotLow      738.44      (6.6%)      730.70      (5.5%)   -1.0% ( -12% -   11%) 0.586
               LowPhrase      255.32      (1.7%)      252.66      (3.0%)   -1.0% (  -5% -    3%) 0.169
              TermDTSort      255.93     (15.2%)      253.43     (13.3%)   -1.0% ( -25% -   32%) 0.829
              AndHighMed      404.90      (2.4%)      401.82      (3.5%)   -0.8% (  -6% -    5%) 0.423
   BrowseMonthTaxoFacets       12.90      (2.9%)       12.80      (3.4%)   -0.7% (  -6% -    5%) 0.469
             LowSpanNear       85.22      (1.5%)       84.71      (2.0%)   -0.6% (  -4% -    2%) 0.285
    BrowseDateTaxoFacets       10.72      (2.7%)       10.66      (2.8%)   -0.6% (  -5% -    5%) 0.523
            HighSpanNear       15.64      (1.9%)       15.55      (2.4%)   -0.5% (  -4% -    3%) 0.423
                Wildcard      135.32      (2.8%)      134.61      (4.0%)   -0.5% (  -7% -    6%) 0.631
               MedPhrase       86.95      (1.6%)       86.51      (2.7%)   -0.5% (  -4% -    3%) 0.470
              AndHighLow      772.76      (3.4%)      769.05      (3.9%)   -0.5% (  -7% -    7%) 0.680
                 LowTerm     1572.57      (4.3%)     1565.62      (4.7%)   -0.4% (  -9% -    9%) 0.758
BrowseDayOfYearTaxoFacets       10.74      (2.8%)       10.71      (2.9%)   -0.3% (  -5% -    5%) 0.713
            OrNotHighLow     1212.21      (3.4%)     1208.45      (4.1%)   -0.3% (  -7% -    7%) 0.797
                  Fuzzy2       52.79     (15.7%)       52.64     (11.6%)   -0.3% ( -23% -   32%) 0.949
    HighIntervalsOrdered       44.22      (1.3%)       44.12      (1.3%)   -0.2% (  -2% -    2%) 0.571
              HighPhrase      385.28      (2.7%)      384.37      (3.4%)   -0.2% (  -6% -    6%) 0.809
                 Respell       94.82      (2.7%)       94.64      (3.3%)   -0.2% (  -5% -    5%) 0.839
           OrNotHighHigh      676.82      (5.0%)      675.66      (6.2%)   -0.2% ( -10% -   11%) 0.923
         MedSloppyPhrase       27.08      (2.9%)       27.07      (3.3%)   -0.0% (  -6% -    6%) 0.980
             AndHighHigh       44.82      (3.8%)       44.90      (4.7%)    0.2% (  -8% -    9%) 0.888
BrowseDayOfYearSSDVFacets       27.27      (2.8%)       27.33      (1.9%)    0.2% (  -4% -    4%) 0.757
        HighSloppyPhrase       87.76      (2.5%)       88.26      (3.4%)    0.6% (  -5% -    6%) 0.542
                 Prefix3      123.34      (4.1%)      124.08      (3.7%)    0.6% (  -6% -    8%) 0.628
            OrHighNotMed      746.91      (5.7%)      752.11      (7.0%)    0.7% ( -11% -   14%) 0.730
         LowSloppyPhrase      329.41      (3.2%)      332.29      (2.3%)    0.9% (  -4% -    6%) 0.324
                  IntNRQ      304.68      (2.6%)      307.51      (2.1%)    0.9% (  -3% -    5%) 0.219
   BrowseMonthSSDVFacets       30.50      (5.9%)       30.87      (2.6%)    1.2% (  -6% -   10%) 0.404
               OrHighMed      128.85      (2.5%)      134.50      (8.8%)    4.4% (  -6% -   16%) 0.033
              OrHighHigh       57.01      (2.5%)      237.95     (22.2%)  317.4% ( 285% -  351%) 0.000
WARNING: cat=OrHighHigh: hit counts differ: 15200+ vs 28513+
WARNING: cat=OrHighMed: hit counts differ: 3700+ vs 6567+
errors occurred: ([], ['query=body:de body:use filter=None sort=None groupField=None hitCount=15200+: wrong hitCount: 15200+ vs 28513+', 'query=body:27 body:throughout filter=None sort=None groupField=None hitCount=3700+: wrong hitCount: 3700+ vs 6567+'], 1.0)

I have a couple of observations / questions from the results:

Not sure if this benchmark result is valid, given the hit count differs? Does the difference in hit count indicate any bug, as I would imagine the difference between BMM and BMW should only impact QPS and ordering?
This seems to suggest that when cost of scorers are close, BMM (DisjunctionMaxScorer) might be faster than BMW (WANDScorer)
I used DisjunctionMaxScorer instead of DisjunctionSumScorer in the changes above, since the later one doesn't seems to have block max logic (e.g. DisjunctionSumScorer#advanceShallow uses its base's implementation Scorer#advanceShallow). Should DisjunctionSumScorer be implemented similar to DisjunctionMaxScorer in terms of block max logic as well?

Zach Chen (@zacharymorn) (migrated from JIRA)

@atris, not sure if you are working on this currently? Right now I'm just exploring this issue out of curiosity, and hopefully my probe doesn't impact your ongoing work.

Michael McCandless (@mikemccand) (migrated from JIRA)

Not sure if this benchmark result is valid, given the hit count differs?

I think (not certain!) luceneutil verifies the top N hits are the same between candidate and baseline?

Since you are changing the BMW/BMM opto, I think it is expected that the (estimated) total hit count would change? In which case, it's fine to ignore that difference.

Zach Chen (@zacharymorn) (migrated from JIRA)

Thanks Michael for the comment! I went ahead and searched for the error message "wrong hitCount" from above in luceneutil, and found these

https://github.com/mikemccand/luceneutil/blob/8fb67282db936c8df33c05b896d92169579c1876/src/python/benchUtil.py#L177-L183 (where "wrong hitCount" was generated)
https://github.com/mikemccand/luceneutil/blob/8fb67282db936c8df33c05b896d92169579c1876/src/python/benchUtil.py#L1626-L1629 (where task was aborted when Runtime error was raised from above)

So I'm guessing that failure actually happened before verification of top N hits as well as id / scores checks, and thus may mask further potential failures and abort the task early (which also explained I got varying benchmark results across multiple latest runs)?

Adrien Grand (@jpountz) (migrated from JIRA)

DisjunctionMaxScorer

It should be DisjunctionSumScorer if we want to get the same scoring (DisjunctionSumScorer sums up scores of the clauses while DisjunctionMaxScorer takes the max).

Can you run the benchmark with verifyCounts=False like we do for nightlies so that it would only check top hits? https://github.com/mikemccand/luceneutil/blob/fbc9ae15a1bb47f7e15c95fb70f1bda57faccfc1/src/python/nightlyBench.py#L822

Atri Sharma (@atris) (migrated from JIRA)

@zacharymorn I will not be able to take this on for a while. Please feel free to take this on.

Zach Chen (@zacharymorn) (migrated from JIRA)

Thanks Atri for the confirmation!

Zach Chen (@zacharymorn) (migrated from JIRA)

bq.DisjunctionMaxScorer

It should be DisjunctionSumScorer if we want to get the same scoring (DisjunctionSumScorer sums up scores of the clauses while DisjunctionMaxScorer takes the max).

Can you run the benchmark with verifyCounts=False like we do for nightlies so that it would only check top hits? https://github\.com/mikemccand/luceneutil/blob/fbc9ae15a1bb47f7e15c95fb70f1bda57faccfc1/src/python/nightlyBench\.py#L822

Ah I was wondering about the nightly benchmark before, and I see now why Michael suggested the top N hit verification above. Let me give that a try. I didn't use DisjunctionSumScorer earlier as I see it seems to be missing some block related method implementation, but I will use that to run the latest benchmark as well.

Zach Chen (@zacharymorn) (migrated from JIRA)

I made the following changes, and actually still saw varying benchmark result across runs (randomized queries?). I've listed them down below:

Changes in Boolean2ScorerSupplier.java (use DisjunctionSumScorer instead of DisjunctionMaxScorer)

diff --git a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
index bdf085d4669..10478ab45bf 100644
--- a/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
+++ b/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java
`@@` -238,11 +238,34 `@@` final class Boolean2ScorerSupplier extends ScorerSupplier {
       //
       // However, as WANDScorer uses more complex algorithm and data structure, we would like to
       // still use DisjunctionSumScorer to handle exhaustive pure disjunctions, which may be faster
-      if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
+      boolean isPureDisjunction =
+              subs.get(Occur.FILTER).isEmpty()
+                      && subs.get(Occur.MUST).isEmpty()
+                      && subs.get(Occur.MUST_NOT).isEmpty();
+      // top-level boolean term query
+      boolean allTermScorers =
+              optionalScorers.stream().allMatch(scorer -> scorer instanceof TermScorer);
+
+      if (isPureDisjunction && allTermScorers && isSimilarCost(optionalScorers) && minShouldMatch <= 1) {
+        return new DisjunctionSumScorer(weight, optionalScorers, scoreMode);
+      } else if (scoreMode == ScoreMode.TOP_SCORES || minShouldMatch > 1) {
         return new WANDScorer(weight, optionalScorers, minShouldMatch, scoreMode);
       } else {
         return new DisjunctionSumScorer(weight, optionalScorers, scoreMode);
       }
     }
   }
+
+  private boolean isSimilarCost(List<Scorer> optionalScorers) {
+    long minCost = Long.MAX_VALUE;
+    long maxCost = Long.MIN_VALUE;
+    for (Scorer scorer : optionalScorers) {
+      long cost = scorer.iterator().cost();
+      minCost = Math.min(minCost, cost);
+      maxCost = Math.max(maxCost, cost);
+    }
+
+    // TODO heuristic based cost-similarity threshold
+    return maxCost / minCost < 2;
+  }
 }

Changes in benchUtil.py to not verify counts

diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py
index fb50033..3579f45 100644
--- a/src/python/benchUtil.py
+++ b/src/python/benchUtil.py
`@@` -1203,7 +1203,7 `@@` class RunAlgs:
     cmpRawResults, heapCmp = parseResults(cmpLogFiles)

     # make sure they got identical results
-    cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, self.verifyCounts)
+    cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, False)

     baseResults = collateResults(baseRawResults)
     cmpResults = collateResults(cmpRawResults)

Benchmark result 1 with source wikimedium5m:

  TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
              OrHighHigh       87.27      (4.9%)       51.12      (1.7%)  -41.4% ( -45% -  -36%) 0.000
               OrHighLow      624.55      (7.6%)      589.16      (8.1%)   -5.7% ( -19% -   10%) 0.022
               OrHighMed      135.02      (3.7%)      129.51      (6.6%)   -4.1% ( -13% -    6%) 0.016
                Wildcard      214.30      (3.3%)      209.33      (2.8%)   -2.3% (  -8% -    3%) 0.017
           OrNotHighHigh      728.60      (8.5%)      713.53      (6.3%)   -2.1% ( -15% -   13%) 0.383
                HighTerm     1195.98      (6.0%)     1174.51      (4.2%)   -1.8% ( -11% -    8%) 0.273
                 LowTerm     1757.60      (6.0%)     1728.64      (4.8%)   -1.6% ( -11% -    9%) 0.336
              AndHighMed      231.78      (4.0%)      227.96      (3.7%)   -1.6% (  -8% -    6%) 0.175
                 Prefix3      196.03      (3.4%)      193.19      (3.5%)   -1.4% (  -8% -    5%) 0.180
                 Respell       59.52      (2.8%)       59.05      (2.7%)   -0.8% (  -6% -    4%) 0.362
                 MedTerm     1507.60      (5.3%)     1495.89      (3.3%)   -0.8% (  -8% -    8%) 0.580
    BrowseDateTaxoFacets       11.04      (3.3%)       10.97      (2.8%)   -0.7% (  -6% -    5%) 0.462
   BrowseMonthTaxoFacets       13.21      (3.4%)       13.12      (3.9%)   -0.7% (  -7% -    6%) 0.542
         MedSloppyPhrase       67.05      (3.4%)       66.58      (4.0%)   -0.7% (  -7% -    6%) 0.544
                  IntNRQ      215.89      (4.2%)      214.39      (2.9%)   -0.7% (  -7% -    6%) 0.543
BrowseDayOfYearTaxoFacets       11.02      (3.2%)       10.96      (2.7%)   -0.6% (  -6% -    5%) 0.546
               LowPhrase      193.14      (4.0%)      192.05      (4.5%)   -0.6% (  -8% -    8%) 0.678
BrowseDayOfYearSSDVFacets       27.80      (5.2%)       27.67      (5.5%)   -0.5% ( -10% -   10%) 0.781
            OrHighNotLow      823.92      (6.1%)      820.15      (4.8%)   -0.5% ( -10% -   11%) 0.790
                PKLookup      215.92      (3.9%)      215.02      (3.8%)   -0.4% (  -7% -    7%) 0.734
                  Fuzzy1       65.82      (7.8%)       65.58     (11.0%)   -0.4% ( -17% -   20%) 0.904
        HighSloppyPhrase       42.05      (3.9%)       41.91      (3.3%)   -0.3% (  -7% -    7%) 0.771
             MedSpanNear      155.25      (3.5%)      154.78      (3.3%)   -0.3% (  -6% -    6%) 0.779
             AndHighHigh       84.97      (4.4%)       84.79      (3.0%)   -0.2% (  -7% -    7%) 0.857
         LowSloppyPhrase      100.76      (3.6%)      100.55      (3.6%)   -0.2% (  -7% -    7%) 0.857
    HighIntervalsOrdered       42.39      (3.4%)       42.34      (3.7%)   -0.1% (  -6% -    7%) 0.921
   HighTermDayOfYearSort      210.65     (15.0%)      210.79     (11.5%)    0.1% ( -23% -   31%) 0.987
              HighPhrase      468.21      (4.5%)      468.66      (4.0%)    0.1% (  -7% -    8%) 0.943
            HighSpanNear      148.68      (3.6%)      148.94      (3.6%)    0.2% (  -6% -    7%) 0.880
            OrNotHighMed      682.83      (7.2%)      684.51      (4.4%)    0.2% ( -10% -   12%) 0.896
            OrHighNotMed      733.07      (7.3%)      736.52      (4.9%)    0.5% ( -10% -   13%) 0.811
           OrHighNotHigh      638.40      (7.2%)      642.31      (4.5%)    0.6% ( -10% -   13%) 0.747
   BrowseMonthSSDVFacets       31.42      (5.2%)       31.65      (2.6%)    0.7% (  -6% -    8%) 0.577
              AndHighLow      923.45      (5.9%)      933.64      (5.3%)    1.1% (  -9% -   13%) 0.534
               MedPhrase      347.33      (5.7%)      351.57      (3.3%)    1.2% (  -7% -   10%) 0.404
             LowSpanNear      311.32      (6.2%)      315.13      (4.2%)    1.2% (  -8% -   12%) 0.466
                  Fuzzy2       59.05     (12.3%)       60.19     (10.6%)    1.9% ( -18% -   28%) 0.594
            OrNotHighLow      851.87      (6.4%)      869.95      (5.6%)    2.1% (  -9% -   15%) 0.263
       HighTermMonthSort       99.63     (12.7%)      102.07     (15.1%)    2.5% ( -22% -   34%) 0.578
    HighTermTitleBDVSort      161.36     (16.9%)      165.53     (18.3%)    2.6% ( -27% -   45%) 0.643
              TermDTSort      195.70     (11.5%)      201.16     (10.4%)    2.8% ( -17% -   27%) 0.420
WARNING: cat=OrHighHigh: hit counts differ: 13070+ vs 357939+

Benchmark result 2 with source wikimedium5m:

        TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy1       69.00     (13.7%)       64.57     (14.5%)   -6.4% ( -30% -   25%) 0.150
                  Fuzzy2       50.00     (15.2%)       47.36     (17.9%)   -5.3% ( -33% -   32%) 0.313
            OrHighNotLow      780.22      (5.3%)      764.22      (3.9%)   -2.1% ( -10% -    7%) 0.165
               OrHighLow      235.87      (3.5%)      232.57      (3.2%)   -1.4% (  -7% -    5%) 0.187
            OrHighNotMed      741.20      (4.0%)      733.20      (4.9%)   -1.1% (  -9% -    8%) 0.450
           OrNotHighHigh      850.77      (5.9%)      842.95      (5.3%)   -0.9% ( -11% -   10%) 0.606
                  IntNRQ      159.89      (1.5%)      158.54      (3.1%)   -0.8% (  -5% -    3%) 0.270
               OrHighMed      161.49      (5.0%)      160.24      (6.0%)   -0.8% ( -11% -   10%) 0.659
                 MedTerm     1534.83      (4.9%)     1524.53      (5.7%)   -0.7% ( -10% -   10%) 0.691
                 LowTerm     1854.25      (6.1%)     1842.02      (5.7%)   -0.7% ( -11% -   11%) 0.725
                 Respell       67.40      (1.8%)       66.96      (2.5%)   -0.7% (  -4% -    3%) 0.346
                Wildcard      222.44      (2.5%)      221.67      (2.3%)   -0.3% (  -5% -    4%) 0.645
                 Prefix3      213.45      (3.6%)      212.73      (3.9%)   -0.3% (  -7% -    7%) 0.776
              OrHighHigh       62.29      (2.8%)       62.08      (2.5%)   -0.3% (  -5% -    5%) 0.700
BrowseDayOfYearSSDVFacets       27.76      (7.0%)       27.69      (6.9%)   -0.3% ( -13% -   14%) 0.897
            HighSpanNear      108.87      (2.3%)      108.63      (3.8%)   -0.2% (  -6% -    6%) 0.820
   BrowseMonthTaxoFacets       13.29      (2.3%)       13.30      (1.9%)    0.0% (  -4% -    4%) 0.940
         LowSloppyPhrase      171.64      (2.8%)      171.88      (2.9%)    0.1% (  -5% -    6%) 0.875
                PKLookup      217.98      (3.9%)      218.52      (3.6%)    0.2% (  -7% -    8%) 0.836
                HighTerm     1392.99      (4.5%)     1396.79      (4.5%)    0.3% (  -8% -    9%) 0.847
    HighIntervalsOrdered       29.54      (2.7%)       29.63      (2.6%)    0.3% (  -4% -    5%) 0.732
              AndHighLow      797.99      (4.0%)      800.55      (3.7%)    0.3% (  -7% -    8%) 0.792
            OrNotHighLow      885.90      (4.3%)      888.91      (5.4%)    0.3% (  -8% -   10%) 0.826
         MedSloppyPhrase       65.81      (2.9%)       66.06      (3.1%)    0.4% (  -5% -    6%) 0.689
             LowSpanNear       59.31      (2.6%)       59.55      (2.5%)    0.4% (  -4% -    5%) 0.609
    BrowseDateTaxoFacets       11.19      (2.8%)       11.25      (2.8%)    0.5% (  -5% -    6%) 0.542
               LowPhrase      170.61      (2.6%)      171.55      (2.6%)    0.6% (  -4% -    5%) 0.502
             MedSpanNear      192.13      (3.2%)      193.32      (2.2%)    0.6% (  -4% -    6%) 0.469
BrowseDayOfYearTaxoFacets       11.20      (2.9%)       11.28      (2.9%)    0.7% (  -4% -    6%) 0.460
        HighSloppyPhrase       82.88      (5.4%)       83.47      (5.5%)    0.7% (  -9% -   12%) 0.681
   BrowseMonthSSDVFacets       31.59      (4.7%)       31.91      (2.0%)    1.0% (  -5% -    8%) 0.387
               MedPhrase      138.53      (2.1%)      140.00      (2.7%)    1.1% (  -3% -    5%) 0.164
              HighPhrase      294.46      (2.7%)      297.99      (2.3%)    1.2% (  -3% -    6%) 0.135
           OrHighNotHigh      654.84      (5.4%)      663.25      (4.8%)    1.3% (  -8% -   12%) 0.427
       HighTermMonthSort      175.68     (10.6%)      178.02     (11.3%)    1.3% ( -18% -   25%) 0.700
   HighTermDayOfYearSort      301.64     (16.6%)      306.26     (17.4%)    1.5% ( -27% -   42%) 0.776
            OrNotHighMed      694.25      (6.0%)      705.04      (5.6%)    1.6% (  -9% -   13%) 0.396
             AndHighHigh      105.76      (2.4%)      107.44      (3.4%)    1.6% (  -4% -    7%) 0.087
              TermDTSort      227.39     (12.5%)      232.52     (11.9%)    2.3% ( -19% -   30%) 0.559
              AndHighMed      241.36      (2.9%)      246.82      (3.5%)    2.3% (  -4% -    8%) 0.026
    HighTermTitleBDVSort      345.18     (13.3%)      361.51     (14.7%)    4.7% ( -20% -   37%) 0.286

Zach Chen (@zacharymorn) (migrated from JIRA)

I made some further changes to move some block max related logic from DisjunctionMaxScorer to DisjunctionScorer, so that DisjunctionSumScorer can inherit. I've published a WIP PR https://github.com/apache/lucene/pull/81 for those changes for the ease of review.

When I run luceneutil, I see further errors from verifyScores section of code, which may indicate bugs in my changes:

WARNING: cat=OrHighHigh: hit counts differ: 9870+ vs 2616+
Traceback (most recent call last):
  File "src/python/localrun.py", line 53, in <module>
    comp.benchmark("baseline_vs_patch")
  File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/competition.py", line 455, in benchmark
    randomSeed = self.randomSeed)
  File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/searchBench.py", line 196, in run
    raise RuntimeError('errors occurred: %s' % str(cmpDiffs))
RuntimeError: errors occurred: ([], ["query=body:second body:short filter=None sort=None groupField=None hitCount=9870+: hit 0 has wrong field/score value ([1444649], '5.0718417') vs ([5125], '4.224689')"], 1.0)

I then made further changes in benchUtil.py to skip over verifyScores, so that I can see what benchmark results it would generate:

diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py
index fb50033..c2faffc 100644
--- a/src/python/benchUtil.py
+++ b/src/python/benchUtil.py
`@@` -1203,7 +1203,7 `@@` class RunAlgs:
     cmpRawResults, heapCmp = parseResults(cmpLogFiles)

     # make sure they got identical results
-    cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, self.verifyCounts)
+    cmpDiffs = compareHits(baseRawResults, cmpRawResults, False, False)

     baseResults = collateResults(baseRawResults)
     cmpResults = collateResults(cmpRawResults)

I then got the following benchmark results from multiple runs

                  TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               OrHighMed      186.44      (2.8%)      160.50      (4.5%)  -13.9% ( -20% -   -6%) 0.000
               OrHighLow      735.70      (7.5%)      696.89      (4.3%)   -5.3% ( -15% -    6%) 0.006
                  Fuzzy1       75.85     (11.5%)       72.81     (14.0%)   -4.0% ( -26% -   24%) 0.323
              TermDTSort      237.49     (10.4%)      228.02     (10.6%)   -4.0% ( -22% -   18%) 0.230
       HighTermMonthSort      280.82      (9.8%)      274.90     (10.8%)   -2.1% ( -20% -   20%) 0.518
                  Fuzzy2       54.08     (12.5%)       53.04     (14.2%)   -1.9% ( -25% -   28%) 0.648
            OrNotHighMed      672.83      (2.7%)      661.16      (4.7%)   -1.7% (  -8% -    5%) 0.153
    HighTermTitleBDVSort      438.56     (14.4%)      431.81     (16.6%)   -1.5% ( -28% -   34%) 0.754
              AndHighLow      969.43      (5.2%)      957.49      (4.7%)   -1.2% ( -10% -    9%) 0.432
           OrNotHighHigh      704.98      (3.4%)      700.72      (3.9%)   -0.6% (  -7% -    7%) 0.605
             AndHighHigh      109.77      (4.2%)      109.31      (4.7%)   -0.4% (  -9% -    8%) 0.767
   BrowseMonthSSDVFacets       32.52      (2.1%)       32.40      (4.6%)   -0.4% (  -6% -    6%) 0.755
                PKLookup      219.90      (3.1%)      219.16      (3.2%)   -0.3% (  -6% -    6%) 0.734
                Wildcard      284.84      (1.9%)      284.18      (1.8%)   -0.2% (  -3% -    3%) 0.690
                 Prefix3      361.00      (2.1%)      360.24      (2.0%)   -0.2% (  -4% -    4%) 0.750
    HighIntervalsOrdered       28.68      (2.2%)       28.64      (1.7%)   -0.1% (  -3% -    3%) 0.819
   BrowseMonthTaxoFacets       13.60      (2.9%)       13.59      (2.7%)   -0.1% (  -5% -    5%) 0.947
BrowseDayOfYearSSDVFacets       28.67      (4.8%)       28.66      (4.8%)   -0.0% (  -9% -   10%) 0.979
            HighSpanNear       79.29      (2.4%)       79.29      (2.2%)    0.0% (  -4% -    4%) 0.997
           OrHighNotHigh      695.37      (5.5%)      696.65      (3.8%)    0.2% (  -8% -   10%) 0.903
                 MedTerm     1478.47      (3.6%)     1481.54      (3.0%)    0.2% (  -6% -    7%) 0.843
   HighTermDayOfYearSort      372.12     (14.1%)      373.08     (14.8%)    0.3% ( -25% -   33%) 0.955
                  IntNRQ      125.36      (1.3%)      125.72      (0.7%)    0.3% (  -1% -    2%) 0.391
             LowSpanNear       52.82      (1.7%)       52.98      (2.0%)    0.3% (  -3% -    4%) 0.611
BrowseDayOfYearTaxoFacets       11.28      (3.1%)       11.31      (3.1%)    0.3% (  -5% -    6%) 0.756
         LowSloppyPhrase      154.42      (2.9%)      154.91      (2.9%)    0.3% (  -5% -    6%) 0.731
               MedPhrase      143.27      (2.9%)      143.88      (2.5%)    0.4% (  -4% -    6%) 0.625
            OrHighNotMed      760.65      (6.8%)      763.93      (5.4%)    0.4% ( -10% -   13%) 0.824
                 Respell       86.71      (1.5%)       87.11      (2.1%)    0.5% (  -3% -    4%) 0.425
             MedSpanNear      210.43      (2.2%)      211.43      (1.4%)    0.5% (  -3% -    4%) 0.414
         MedSloppyPhrase      220.29      (2.6%)      221.35      (2.2%)    0.5% (  -4% -    5%) 0.528
    BrowseDateTaxoFacets       11.24      (3.0%)       11.30      (3.1%)    0.6% (  -5% -    6%) 0.529
               LowPhrase      174.98      (2.4%)      176.05      (2.0%)    0.6% (  -3% -    5%) 0.385
        HighSloppyPhrase      100.25      (3.2%)      100.88      (3.0%)    0.6% (  -5% -    7%) 0.524
            OrHighNotLow     1016.27      (7.7%)     1025.44      (6.9%)    0.9% ( -12% -   16%) 0.696
                 LowTerm     1634.20      (3.8%)     1649.20      (2.6%)    0.9% (  -5% -    7%) 0.376
              HighPhrase      415.55      (3.0%)      419.76      (2.9%)    1.0% (  -4% -    7%) 0.282
            OrNotHighLow      940.18      (5.4%)      952.61      (2.7%)    1.3% (  -6% -    9%) 0.328
                HighTerm     1163.01      (3.8%)     1178.47      (4.8%)    1.3% (  -6% -   10%) 0.329
              AndHighMed      365.15      (4.4%)      370.53      (3.1%)    1.5% (  -5% -    9%) 0.225
              OrHighHigh       80.20      (2.2%)      718.16    (158.9%)  795.4% ( 620% -  978%) 0.000
WARNING: cat=OrHighHigh: hit counts differ: 19022+ vs 1002+
WARNING: cat=OrHighMed: hit counts differ: 4321+ vs 4289+

         TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               OrHighLow      667.55      (4.3%)      649.15      (5.9%)   -2.8% ( -12% -    7%) 0.092
           OrHighNotHigh      866.11      (5.2%)      851.54      (5.2%)   -1.7% ( -11% -    9%) 0.305
   HighTermDayOfYearSort      293.23     (16.4%)      288.53     (15.0%)   -1.6% ( -28% -   35%) 0.747
                  Fuzzy1       58.36     (10.3%)       57.55     (10.2%)   -1.4% ( -19% -   21%) 0.666
            OrHighNotLow      709.78      (3.0%)      702.08      (3.9%)   -1.1% (  -7% -    5%) 0.322
                 LowTerm     1816.04      (4.4%)     1797.54      (4.8%)   -1.0% (  -9% -    8%) 0.488
                  Fuzzy2       50.30     (11.0%)       49.85     (11.7%)   -0.9% ( -21% -   24%) 0.802
              OrHighHigh       55.81      (2.4%)       55.35      (2.4%)   -0.8% (  -5% -    4%) 0.288
            HighSpanNear       15.16      (2.9%)       15.07      (3.0%)   -0.6% (  -6% -    5%) 0.547
             MedSpanNear       67.82      (3.1%)       67.47      (3.5%)   -0.5% (  -6% -    6%) 0.613
               OrHighMed      195.58      (2.7%)      194.60      (2.6%)   -0.5% (  -5% -    4%) 0.548
             LowSpanNear       36.88      (2.7%)       36.75      (2.9%)   -0.3% (  -5% -    5%) 0.690
   BrowseMonthTaxoFacets       13.05      (3.3%)       13.01      (3.5%)   -0.3% (  -6% -    6%) 0.749
    HighIntervalsOrdered       44.33      (1.3%)       44.18      (1.4%)   -0.3% (  -3% -    2%) 0.439
        HighSloppyPhrase       17.92      (4.4%)       17.87      (4.3%)   -0.3% (  -8% -    8%) 0.821
               MedPhrase       78.25      (2.5%)       78.12      (2.1%)   -0.2% (  -4% -    4%) 0.823
BrowseDayOfYearSSDVFacets       27.73      (2.4%)       27.68      (2.0%)   -0.2% (  -4% -    4%) 0.817
                PKLookup      213.40      (2.9%)      213.09      (2.4%)   -0.1% (  -5% -    5%) 0.862
             AndHighHigh      100.38      (2.9%)      100.25      (2.9%)   -0.1% (  -5% -    5%) 0.891
BrowseDayOfYearTaxoFacets       10.79      (3.4%)       10.78      (3.7%)   -0.1% (  -6% -    7%) 0.912
              AndHighLow      778.66      (3.4%)      778.10      (3.1%)   -0.1% (  -6% -    6%) 0.945
                Wildcard      141.04      (2.1%)      141.00      (2.4%)   -0.0% (  -4% -    4%) 0.970
    BrowseDateTaxoFacets       10.77      (3.3%)       10.77      (3.6%)   -0.0% (  -6% -    7%) 0.993
    HighTermTitleBDVSort      222.22     (12.5%)      222.20     (11.7%)   -0.0% ( -21% -   27%) 0.998
         LowSloppyPhrase       34.64      (3.6%)       34.64      (3.4%)   -0.0% (  -6% -    7%) 0.994
                  IntNRQ      143.05      (0.6%)      143.25      (0.8%)    0.1% (  -1% -    1%) 0.546
   BrowseMonthSSDVFacets       30.95      (5.2%)       31.00      (5.0%)    0.2% (  -9% -   10%) 0.922
                 Respell       66.20      (2.2%)       66.36      (1.8%)    0.2% (  -3% -    4%) 0.719
              AndHighMed      300.10      (2.5%)      300.83      (2.9%)    0.2% (  -5% -    5%) 0.775
               LowPhrase       54.92      (2.6%)       55.07      (2.1%)    0.3% (  -4% -    5%) 0.701
                 MedTerm     1522.58      (3.9%)     1529.48      (3.8%)    0.5% (  -7% -    8%) 0.711
            OrNotHighLow      965.50      (5.2%)      972.89      (2.8%)    0.8% (  -6% -    9%) 0.564
              HighPhrase      163.18      (2.5%)      164.66      (2.6%)    0.9% (  -4% -    6%) 0.259
                HighTerm     1400.74      (3.8%)     1417.05      (3.9%)    1.2% (  -6% -    9%) 0.335
         MedSloppyPhrase      146.41      (2.6%)      148.13      (2.8%)    1.2% (  -4% -    6%) 0.171
                 Prefix3      355.38      (3.1%)      359.62      (3.4%)    1.2% (  -5% -    7%) 0.242
           OrNotHighHigh      704.32      (4.2%)      713.26      (4.6%)    1.3% (  -7% -   10%) 0.363
              TermDTSort      227.11     (13.3%)      230.00     (13.5%)    1.3% ( -22% -   32%) 0.764
            OrHighNotMed      724.29      (4.2%)      736.47      (3.8%)    1.7% (  -6% -   10%) 0.183
       HighTermMonthSort      134.94     (10.6%)      137.37     (10.2%)    1.8% ( -17% -   25%) 0.583
            OrNotHighMed      764.10      (3.7%)      778.89      (3.4%)    1.9% (  -4% -    9%) 0.082

Adrien Grand (@jpountz) (migrated from JIRA)

I made the following changes, and actually still saw varying benchmark result across runs (randomized queries?)

Indeed the benchmark randomly picks queries in the tasks file.

Changes in benchUtil.py to not verify counts

Actually you should be able to do it without modifying the benchmarking code, by configuring your Competition object to not verify counts like that in your localrun file: comp = competition.Competition(verifyCounts=False)

When I run luceneutil, I see further errors from verifyScores section of code, which may indicate bugs in my changes:

Indeed this indicates that the query returns different top hits with your change. If the change was in the order of one ulp, then this could be due to the fact that the sum might depend on the order in which clauses' scores are summed up, but given the significant score difference, there must be a bigger problem. Have you run tests with this change? This could help figure out where the bug is.

Zach Chen (@zacharymorn) (migrated from JIRA)

Actually you should be able to do it without modifying the benchmarking code, by configuring your Competition object to not verify counts like that in your localrun file: {{comp = competition.Competition(verifyCounts=False)}}

Ah I see. Thanks for the tip, will use that going forward!

Indeed this indicates that the query returns different top hits with your change. If the change was in the order of one ulp, then this could be due to the fact that the sum might depend on the order in which clauses' scores are summed up, but given the significant score difference, there must be a bigger problem. Have you run tests with this change? This could help figure out where the bug is.

Yes the ./gradlew check was passing before, but I saw your comment in PR and that calculation was indeed incorrect. Let me correct that and try again.

Zach Chen (@zacharymorn) (migrated from JIRA)

Hi @jpountz, I took a stab at implementing BMM and published a new PR here for further discussion https://github.com/apache/lucene/pull/101 . I'm pretty happy about being able to implement a new scorer, even though its performance is a bit poor (although seems to be on par with the experiment result published in http://engineering.nyu.edu/\~suel/papers/bmm.pdf for BMM and BMW comparison for 2-clause OR query). Shall we consider adding benchmark query set with 5+ clauses to see the performance comparison, as that seems to be when BMM may outperform BMW as the paper suggested?

Adrien Grand (@jpountz) (migrated from JIRA)

I'd be interested in seeing how the results look like with 5+ clauses indeed.

And that also makes me curious about whether we could do better with a BulkScorer. Currently your PR needs to put a lot of code in doAdvance to reason about max scores, check if we need to move scorers from the list of essential scorers to the list of optional scorers or not, etc. I think that this code has non-negligible overhead since it's called on every match. A BulkScorer would make it easier to only do this sort of things on periodic checkpoints. This is the trick that BooleanScorer uses: in order to not have to keep a heap constantly ordered, it scores windows of 2048 documents at a time.

Zach Chen (@zacharymorn) (migrated from JIRA)

Makes sense. I guess the general strategy then would be to implement BMM in the BulkScorer, and do the maxScore initialization and essential / non-essential lists partition once and valid only within that 2048 documents boundary. I'll give that a try!

Zach Chen (@zacharymorn) (migrated from JIRA)

I've implemented the above strategy and opened a new PR for it https://github.com/apache/lucene/pull/113. I was using a BulkScorer on top of a collection of Scorers though, instead of a BulkScorer on top of a collection of BulkScorers like BooleanScorer, and hope the difference is due to algorithm difference rather than me misunderstanding the intended usage of BulkScorer interface :D . The result from benchmark util still shows it's slower than WANDScorer for 2 clauses queries, especially for OrHighHigh task.

During the implementation of this BulkScorer I also realized there were some issues with the other PR I published earlier, so I'll fix them next and see if that will give us better result.

Zach Chen (@zacharymorn) (migrated from JIRA)

Hi @jpountz, I've done another pass and fixed a few issues in https://github.com/apache/lucene/pull/101. I tried some other optimizations as well (such as moving scorer from essential to non-essential list every time minCompetitiveScore gets updated), but they didn't seems to improve the benchmark results much for pure disjunction queries in both implementations. Assuming there's no major miss / bug in the two implementations so far, I also feel that compared with BMW, the main bottleneck in BMM for 2-clause OR queries run by the benchmark is indeed the additional frequent operations performed to check and align on the max score boundary.

What do you think? Do you have any suggestion where I should look next?

Zach Chen (@zacharymorn) (migrated from JIRA)

I was trying to modify the CreateQueries class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file wikimedium.10M.nostopwords.tasks to this ticket.

Here are the luceneutil results from 2 runs for each implementation:

Scorer https://github.com/apache/lucene/pull/101

                   TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
    OrHighHighMedHighLow       30.97      (6.2%)       24.92      (4.4%)  -19.5% ( -28% -   -9%) 0.000
                PKLookup      223.53      (2.4%)      228.10      (3.7%)    2.0% (  -3% -    8%) 0.037

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value     
    OrHighHighMedHighLow       32.83      (3.4%)       34.00      (5.1%)    3.6% (  -4% -   12%) 0.009                         
                PKLookup      217.86      (2.8%)      228.14      (4.2%)    4.7% (  -2% -   12%) 0.000

BulkScorer https://github.com/apache/lucene/pull/113

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                PKLookup      197.84      (4.1%)      207.79      (4.2%)    5.0% (  -3% -   13%) 0.000
    OrHighHighMedHighLow       32.50     (16.7%)       35.79      (9.9%)   10.1% ( -14% -   44%) 0.020

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value     
    OrHighHighMedHighLow       28.61      (5.4%)       22.28      (4.2%)  -22.1% ( -30% -  -13%) 0.000                 
                PKLookup      227.38      (2.6%)      233.05      (2.7%)    2.5% (  -2% -    8%) 0.003

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for writing two scorers to test this out! Would you be able to run queries under a profiler to see where your new scorers are spending most time? This might help identify how we could make them faster.

Also thanks for testing with more queries, FWIW it would be good enough to only add 4-5 new queries to the tasks file to play with the change. By the way I'd be curious to see how your new scorers perform with 5 "Med" terms, which should be a worst-case scenario for BMW as all terms should have similar max scores. Since the queries you ran have a "Low" term, I wonder that this term drives iteration, which prevents BMM from showing the lower overhead it has compared to BMW.

Zach Chen (@zacharymorn) (migrated from JIRA)

No problem! Writing these scorers has actually been a great exercise for me to understand more on the scoring related APIs and benchmark testing. I have enjoyed it a lot!

For the profiling, are you referring to JFR? It is currently enabled by default in luceneutil and I've added the result below from 5 "Med" terms queries (queries file wikimedium.10M.nostopwords.tasks.5OrMeds attached) :

BMM Scorer Run 1

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       OrMedMedMedMedMed       40.66      (9.7%)       32.77      (7.7%)  -19.4% ( -33% -   -2%) 0.000
                PKLookup      215.12      (1.5%)      221.56      (1.8%)    3.0% (   0% -    6%) 0.000

CPU merged search profile for my_modified_version:

PROFILE SUMMARY from 12153 events (total: 12153)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
4.24%         515           org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance()
4.22%         513           org.apache.lucene.search.BlockMaxMaxscoreScorer$1#updateUpToAndMaxScore()
3.11%         378           java.util.LinkedList#listIterator()
2.53%         307           java.util.LinkedList$ListItr#next()
2.42%         294           java.util.zip.Inflater#inflateBytesBytes()
2.15%         261           org.apache.lucene.search.DisiPriorityQueue#pop()
1.60%         195           jdk.internal.misc.Unsafe#getByte()
1.47%         179           org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches()
1.41%         171           java.util.AbstractList$SubList#listIterator()
1.36%         165           java.util.AbstractList#listIterator()
1.31%         159           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.22%         148           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.21%         147           org.apache.lucene.search.DisiPriorityQueue#add()
1.20%         146           java.util.LinkedList$ListItr#<init>()
1.15%         140           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.11%         135           java.util.LinkedList$ListItr#checkForComodification()
1.08%         131           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.04%         126           java.nio.DirectByteBuffer#get()
1.00%         122           java.lang.Object#wait()
1.00%         122           org.apache.lucene.search.DisiPriorityQueue#downHeap()
0.95%         116           java.util.AbstractList$Itr#<init>()
0.82%         100           java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.81%         98            org.apache.lucene.store.ByteBufferGuard#getByte()
0.80%         97            org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.73%         89            sun.nio.fs.UnixNativeDispatcher#open0()
0.73%         89            java.lang.ClassLoader#defineClass1()
0.72%         87            java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.69%         84            org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.63%         77            jdk.internal.util.ArraysSupport#mismatch()
0.63%         76            org.apache.lucene.search.BlockMaxMaxscoreScorer$1#repartitionLists()

CPU merged search profile for baseline:

PROFILE SUMMARY from 9671 events (total: 9671)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
2.96%         286           java.util.zip.Inflater#inflateBytesBytes()
1.78%         172           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.73%         167           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.72%         166           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.57%         152           java.lang.Object#wait()
1.57%         152           org.apache.lucene.search.DisiPriorityQueue#add()
1.51%         146           org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.43%         138           java.nio.DirectByteBuffer#get()
1.35%         131           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.15%         111           java.io.RandomAccessFile#readBytes()
1.10%         106           java.lang.ClassLoader#defineClass1()
1.07%         103           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.01%         98            org.apache.lucene.store.ByteBufferGuard#getByte()
1.01%         98            org.apache.lucene.search.WANDScorer#addTail()
1.00%         97            org.apache.lucene.search.WANDScorer#moveToNextCandidate()
1.00%         97            java.io.UnixFileSystem#getBooleanAttributes0()
0.99%         96            sun.nio.fs.UnixNativeDispatcher#open0()
0.93%         90            java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.90%         87            org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.90%         87            java.util.Arrays#rangeCheck()
0.88%         85            java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.86%         83            org.apache.lucene.search.WANDScorer#advanceTail()
0.84%         81            jdk.internal.util.ArraysSupport#mismatch()
0.80%         77            org.apache.lucene.search.WANDScorer#popTail()
0.78%         75            jdk.internal.misc.Unsafe#getByte()
0.74%         72            org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.71%         69            org.apache.lucene.search.DisiPriorityQueue#pop()
0.71%         69            org.apache.lucene.search.WANDScorer$2#matches()
0.65%         63            org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#scanToTermLeaf()
0.64%         62            org.apache.lucene.search.DisiPriorityQueue#top()

—

BMM Scorer Run 2

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       OrMedMedMedMedMed       36.68     (10.0%)       22.26     (10.2%)  -39.3% ( -54% -  -21%) 0.000
                PKLookup      217.96      (9.2%)      221.95      (9.5%)    1.8% ( -15% -   22%) 0.537

CPU merged search profile for my_modified_version:

PROFILE SUMMARY from 15355 events (total: 15355)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
4.04%         621           org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance()
3.30%         507           java.util.AbstractList#subList()
3.15%         484           java.util.LinkedList#listIterator()
2.40%         369           java.util.AbstractList$SubList#listIterator()
2.38%         365           org.apache.lucene.search.DisiPriorityQueue#upHeap()
2.29%         351           java.util.zip.Inflater#inflateBytesBytes()
2.10%         323           java.util.Arrays#asList()
1.95%         300           java.util.AbstractList#listIterator()
1.83%         281           jdk.internal.misc.Unsafe#getByte()
1.75%         269           org.apache.lucene.search.WANDScorer#scaleMaxScore()
1.53%         235           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.45%         222           java.util.LinkedList$ListItr#next()
1.39%         213           org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches()
1.35%         208           org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.32%         202           java.util.AbstractList$Itr#<init>()
1.15%         176           org.apache.lucene.search.DisiPriorityQueue#pop()
1.13%         174           java.nio.DirectByteBuffer#get()
1.04%         159           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.03%         158           org.apache.lucene.search.LeafSimScorer#getNormValue()
1.03%         158           java.util.AbstractList$SubList$1#nextIndex()
0.96%         148           org.apache.lucene.search.DisiPriorityQueue#top()
0.96%         147           java.util.AbstractList$Itr#next()
0.94%         145           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
0.83%         127           java.lang.Object#wait()
0.79%         122           org.apache.lucene.store.ByteBufferGuard#getByte()
0.69%         106           org.apache.lucene.store.DataInput#readVInt()
0.66%         102           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.64%         98            java.lang.Math#scalb()
0.64%         98            java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.63%         96            org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()

CPU merged search profile for baseline:

PROFILE SUMMARY from 11671 events (total: 11671)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
2.96%         345           java.util.zip.Inflater#inflateBytesBytes()
2.56%         299           java.io.RandomAccessFile#readBytes()
2.29%         267           org.apache.lucene.search.DisiPriorityQueue#add()
2.04%         238           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
2.01%         235           org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.97%         230           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.91%         223           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.66%         194           java.nio.DirectByteBuffer#get()
1.48%         173           org.apache.lucene.search.WANDScorer#moveToNextCandidate()
1.31%         153           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.29%         151           org.apache.lucene.search.WANDScorer#addTail()
1.24%         145           java.lang.Object#wait()
1.23%         143           org.apache.lucene.search.DisiPriorityQueue#pop()
1.17%         137           org.apache.lucene.search.WANDScorer#popTail()
1.11%         129           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.04%         121           org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
1.01%         118           org.apache.lucene.search.WANDScorer$2#matches()
0.94%         110           org.apache.lucene.search.DisiPriorityQueue#top()
0.91%         106           sun.nio.fs.UnixNativeDispatcher#open0()
0.90%         105           java.io.FileInputStream#readBytes()
0.90%         105           java.lang.ClassLoader#defineClass1()
0.90%         105           org.apache.lucene.store.DataInput#readVInt()
0.90%         105           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.83%         97            org.apache.lucene.store.ByteBufferGuard#getByte()
0.80%         93            org.apache.lucene.search.WANDScorer#advanceTail()
0.75%         88            jdk.internal.misc.Unsafe#getByte()
0.72%         84            java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.68%         79            perf.PKLookupTask#go()
0.68%         79            java.io.UnixFileSystem#getBooleanAttributes0()
0.65%         76            org.apache.lucene.search.WANDScorer#upHeapMaxScore()

----

BMM BulkScorer Run 1

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       OrMedMedMedMedMed       34.58     (16.9%)       28.75      (9.0%)  -16.9% ( -36% -   10%) 0.000
                PKLookup      223.50      (2.6%)      229.67      (2.6%)    2.8% (  -2% -    8%) 0.001

CPU merged search profile for my_modified_version:

PROFILE SUMMARY from 12720 events (total: 12720)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
3.18%         405           java.util.LinkedList#listIterator()
2.62%         333           java.util.zip.Inflater#inflateBytesBytes()
2.43%         309           org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$1#doAdvance()
2.21%         281           java.util.AbstractList#subList()
2.16%         275           org.apache.lucene.search.WANDScorer#scaleMaxScore()
1.96%         249           org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$2#matches()
1.94%         247           jdk.internal.misc.Unsafe#getByte()
1.71%         217           java.util.Arrays#asList()
1.68%         214           java.util.AbstractList#listIterator()
1.66%         211           java.util.AbstractList$SubList#listIterator()
1.63%         207           java.util.LinkedList$ListItr#next()
1.37%         174           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.26%         160           java.lang.Object#wait()
1.16%         148           org.apache.lucene.search.LeafSimScorer#getNormValue()
1.12%         143           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.04%         132           java.nio.DirectByteBuffer#get()
1.03%         131           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.03%         131           java.util.AbstractList$Itr#<init>()
0.96%         122           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
0.91%         116           org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer#updateIntervalBoundary()
0.90%         114           jdk.internal.util.ArraysSupport#mismatch()
0.89%         113           org.apache.lucene.search.DisiPriorityQueue#pop()
0.88%         112           java.util.AbstractList$SubList$1#nextIndex()
0.83%         105           org.apache.lucene.search.DisiPriorityQueue#downHeap()
0.74%         94            java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.73%         93            org.apache.lucene.store.ByteBufferGuard#getByte()
0.73%         93            org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.72%         91            sun.nio.fs.UnixNativeDispatcher#open0()
0.72%         91            org.apache.lucene.search.DisiPriorityQueue#add()
0.68%         86            org.apache.lucene.search.DisiPriorityQueue#top()

CPU merged search profile for baseline:

PROFILE SUMMARY from 9996 events (total: 9996)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
3.16%         316           java.util.zip.Inflater#inflateBytesBytes()
1.89%         189           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.74%         174           org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.73%         173           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.65%         165           org.apache.lucene.search.DisiPriorityQueue#add()
1.63%         163           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.62%         162           java.io.RandomAccessFile#readBytes()
1.46%         146           java.lang.Object#wait()
1.41%         141           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.21%         121           org.apache.lucene.search.WANDScorer#addTail()
1.16%         116           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.11%         111           jdk.internal.util.ArraysSupport#mismatch()
1.04%         104           java.nio.DirectByteBuffer#get()
1.02%         102           org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
1.00%         100           java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.98%         98            org.apache.lucene.search.WANDScorer$2#matches()
0.94%         94            org.apache.lucene.search.DisiPriorityQueue#pop()
0.94%         94            org.apache.lucene.search.WANDScorer#popTail()
0.91%         91            org.apache.lucene.search.DisiPriorityQueue#top()
0.86%         86            org.apache.lucene.util.BytesRef#compareTo()
0.85%         85            org.apache.lucene.store.ByteBufferGuard#getByte()
0.82%         82            java.io.UnixFileSystem#getBooleanAttributes0()
0.79%         79            sun.nio.fs.UnixNativeDispatcher#open0()
0.78%         78            jdk.internal.misc.Unsafe#getByte()
0.77%         77            java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.73%         73            java.lang.ClassLoader#defineClass1()
0.73%         73            org.apache.lucene.search.WANDScorer#advanceTail()
0.70%         70            org.apache.lucene.search.WANDScorer#moveToNextCandidate()
0.69%         69            java.lang.invoke.InvokerBytecodeGenerator#emitStaticInvoke()
0.67%         67            perf.PKLookupTask#go()

—

BMM BulkScorer Run 2

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       OrMedMedMedMedMed       38.91     (10.4%)       22.19      (4.1%)  -43.0% ( -52% -  -31%) 0.000
                PKLookup      220.07      (5.4%)      226.90      (3.1%)    3.1% (  -5% -   12%) 0.026

CPU merged search profile for my_modified_version:

PROFILE SUMMARY from 15143 events (total: 15143)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
3.94%         596           org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$1#doAdvance()
3.52%         533           java.util.AbstractList#subList()
3.20%         485           java.util.LinkedList#listIterator()
2.81%         426           java.util.AbstractList$SubList#listIterator()
2.72%         412           org.apache.lucene.search.WANDScorer#scaleMaxScore()
2.44%         369           java.util.AbstractList#listIterator()
2.37%         359           org.apache.lucene.search.DisiPriorityQueue#upHeap()
2.27%         343           java.util.Arrays#asList()
2.15%         326           java.util.zip.Inflater#inflateBytesBytes()
1.71%         259           jdk.internal.misc.Unsafe#getByte()
1.44%         218           java.util.AbstractList$Itr#<init>()
1.40%         212           org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.35%         205           java.util.LinkedList$ListItr#next()
1.33%         202           org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer$2#matches()
1.31%         199           org.apache.lucene.search.DisiPriorityQueue#pop()
1.20%         182           java.nio.DirectByteBuffer#get()
1.10%         167           java.util.AbstractList$SubList$1#nextIndex()
1.07%         162           java.util.AbstractList$Itr#next()
0.99%         150           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
0.96%         146           org.apache.lucene.search.LeafSimScorer#getNormValue()
0.94%         143           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
0.92%         139           org.apache.lucene.search.DisiPriorityQueue#top()
0.90%         137           java.lang.Object#wait()
0.79%         119           org.apache.lucene.search.BMMBulkScorer$BMMBoundaryAwareScorer#updateIntervalBoundary()
0.77%         117           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
0.71%         108           java.util.AbstractList$SubList$1#<init>()
0.71%         108           org.apache.lucene.store.ByteBufferGuard#getByte()
0.67%         101           sun.nio.fs.UnixNativeDispatcher#open0()
0.66%         100           org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.61%         93            java.lang.ClassLoader#defineClass1()

CPU merged search profile for baseline:

PROFILE SUMMARY from 10813 events (total: 10813)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
2.67%         289           org.apache.lucene.search.DisiPriorityQueue#add()
2.57%         278           java.util.zip.Inflater#inflateBytesBytes()
2.23%         241           org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.96%         212           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.91%         207           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.65%         178           java.lang.Object#wait()
1.49%         161           java.nio.DirectByteBuffer#get()
1.36%         147           org.apache.lucene.search.DisiPriorityQueue#top()
1.34%         145           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.29%         139           org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.29%         139           org.apache.lucene.search.WANDScorer#addTail()
1.28%         138           java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.17%         126           org.apache.lucene.search.WANDScorer#moveToNextCandidate()
1.16%         125           org.apache.lucene.search.DisiPriorityQueue#pop()
1.15%         124           org.apache.lucene.search.WANDScorer$2#matches()
1.09%         118           org.apache.lucene.search.WANDScorer#popTail()
1.04%         112           org.apache.lucene.search.WANDScorer#advanceTail()
1.00%         108           org.apache.lucene.store.ByteBufferGuard#getByte()
0.96%         104           org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.95%         103           java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.92%         99            sun.nio.fs.UnixNativeDispatcher#open0()
0.89%         96            java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.86%         93            java.lang.ClassLoader#defineClass1()
0.83%         90            jdk.internal.misc.Unsafe#getByte()
0.80%         86            java.util.Arrays#compareUnsigned()
0.79%         85            org.apache.lucene.search.WANDScorer#upHeapMaxScore()
0.76%         82            perf.PKLookupTask#go()
0.68%         74            org.apache.lucene.search.WANDScorer#pushBackLeads()
0.68%         74            java.io.RandomAccessFile#readBytes()
0.68%         74            org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score()

I think what stands out from these comparisons is, the 2 BMM implementations spent a lot of CPU cycles in checking and advancing to the next candidate doc, whereas WAND spent time in advancing at the posting reader level. Maybe the BMM implementations did not propagate minScore correctly / effectively to the underlying scorers so that a lot more candidate docs were surfaced from TermScorer? Another possibility is BMM may indeed require more operations than BMW, as suggested by the paper. I will check more and get back to this...

Zach Chen (@zacharymorn) (migrated from JIRA)

Just want to provide a quick summary of the latest progress of this issue. Currently there are 3 different BMM implementations from 2 PRs:

Scorer based implementation
1. PR: https://github.com/apache/lucene/pull/101
2. wikibigall benchmark results: https://github.com/apache/lucene/pull/101#issuecomment-840255508
  1. On average it improves OrHighHigh by 40%+, and OrHighMed around 20%
  2. 1 out of 3 runs it hurt AndMedOrHighHigh and OrHighMed performance by around 16%
BulkScorer based implementation with fixed window size
1. PR: https://github.com/apache/lucene/pull/113
2. wikibigall benchmark with window size 1024 results: https://github.com/apache/lucene/pull/113#issuecomment-840293637
  1. On average it improves OrHighHigh by 3-8%, and OrHighMed by 23%+
  2. For some reasons it hurt Fuzzy1 & Fuzzy2 performance by around 8%, even though it wasn't used for those queries
BulkScorer based implementation without window, and using the scorer implementation from #1
1. Commit: https://github.com/zacharymorn/lucene/commit/3bcdbb31a7d55b00cb53e4be40a4adc93b9f30db
2. wikibigall benchmark results: https://github.com/apache/lucene/pull/113#discussion_r631568912
  1. On average it improves OrHighHigh by 52%, and OrHighMed 10% - 18%
  2. For some reasons it hurt Fuzzy1 & Fuzzy2 performance consistently by around 8%-13%, even though it wasn't used for those queries

@jpountz what do you think about the above results as well as the latest changes, and any other idea we would like to try on? From the current results it appears option 1 might be the one to go with? I can start to work on productizing the changes and adding tests if we have settled down on the implementation approach here.

Adrien Grand (@jpountz) (migrated from JIRA)

For some reasons it hurt Fuzzy1 & Fuzzy2 performance consistently by around 8%-13%, even though it wasn't used for those queries

Are you sure? I believe that fuzzy queries rewrite to boolean queries, so they would use your new block-max maxscore under the hood?

Zach Chen (@zacharymorn) (migrated from JIRA)

Are you sure? I believe that fuzzy queries rewrite to boolean queries, so they would use your new block-max maxscore under the hood?

Hmm I verified that by throwing runtime exception in the BMM BulkScorer's constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which completed successfully. I feel the slow down may come from the checks to see if BMM is applicable. Let me take a further look there.

Zach Chen (@zacharymorn) (migrated from JIRA)

I see why Fuzzy1 & Fuzzy2 did not trigger BMM scorer / bulkScorer now. Those queries were rewritten into boolean queries with boosting (BoostQuery), but in the BMM eligibility check I had check for TermQuery directly https://github.com/apache/lucene/pull/113/files#diff-d500c30048128831b0fe3c53d9bb74eed7d8063e81d33737b26dcd00bc7f1fd2R337 , hence the BMM scorer / bulkScorer were not invoked for them.

Also likely the looping in that check hurt performance for both implementations, as fuzzy queries can expand into ones with many subqueries (one instance I saw was 50 subqueries), and the current logic would go through all subqueries.

Zach Chen (@zacharymorn) (migrated from JIRA)

I made some changes to the BulkScorer implementations to return false for BMM eligibility immediately when non term query was identified, and they improved the benchmark results for Fuzzy1 & Fuzzy2 a bit (https://github.com/apache/lucene/pull/113/commits/f4115f78be0833b65694ad6a0f9f4f32565091e7). However, it appears that Fuzzy1 & Fuzzy2 benchmark results would vary more in general across runs / queries used compared to other tasks.

Zach Chen (@zacharymorn) (migrated from JIRA)

@jpountz what do you think about the results we got so far? If we are good with the trade-off and performance improvement BMM has for OrHighHigh and __ OrHighMed queries, I can work on productizing the changes next.

Adrien Grand (@jpountz) (migrated from JIRA)

The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster?

I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers:

BMW

AVG: 1.0851470951E7
Median: 5552285
P75: 12087216
P90: 26834970
P95: 40460199
P99: 77821369
Collected AVG: 8168.523
Collected Median: 2259
Collected P75: 3735
Collected P90: 6228
Collected P95: 13063
Collected P99: 221894

BMM - scorer

AVG: 3.0175025635E7
Median: 18845216
P75: 37770072
P90: 77785996
P95: 98711520
P99: 181530180
Collected AVG: 50089.09
Collected Median: 20322
Collected P75: 57955
Collected P90: 125052
Collected P95: 194696
Collected P99: 442811

BMM - bulk scorer

AVG: 5.3372459518E7
Median: 18658182
P75: 60750919
P90: 143040509
P95: 227538646
P99: 461590829
Collected AVG: 525419.23
Collected Median: 109750
Collected P75: 563404
Collected P90: 1651320
Collected P95: 2597310
Collected P99: 4508467

Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM?

EDIT: I had first got wrong results for BMM scorer because I had run with an old commit of the PR, this is now fixed.

Zach Chen (@zacharymorn) (migrated from JIRA)

The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster?

Ah not sure why I didn't think of running them through BMM earlier! I just gave them a run, and got the following results:

BMM Scorer

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy1       30.46     (24.7%)       17.63     (11.6%)  -42.1% ( -62% -   -7%) 0.000
                  Fuzzy2       21.61     (16.4%)       16.28     (12.0%)  -24.7% ( -45% -    4%) 0.000
                PKLookup      216.72      (4.1%)      215.63      (3.0%)   -0.5% (  -7% -    6%) 0.654

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy1       30.58      (9.1%)       22.12      (6.4%)  -27.7% ( -39% -  -13%) 0.000
                  Fuzzy2       36.07     (12.7%)       27.05     (10.8%)  -25.0% ( -42% -   -1%) 0.000
                PKLookup      215.26      (3.4%)      213.99      (2.5%)   -0.6% (  -6% -    5%) 0.530

BMMBulkScorer without window (with the above scorer implementation)

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy2       16.32     (22.6%)       15.68     (16.3%)   -3.9% ( -34% -   45%) 0.527
                  Fuzzy1       48.11     (17.6%)       47.48     (13.6%)   -1.3% ( -27% -   36%) 0.791
                PKLookup      213.67      (3.2%)      212.52      (4.0%)   -0.5% (  -7% -    6%) 0.640

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy2       26.99     (23.2%)       24.75     (13.6%)   -8.3% ( -36% -   37%) 0.169
                PKLookup      216.27      (4.3%)      216.43      (3.4%)    0.1% (  -7% -    8%) 0.951
                  Fuzzy1       19.01     (24.2%)       20.01     (14.2%)    5.3% ( -26% -   57%) 0.400

BMMBulkScorer with window size 1024

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy2       23.56     (26.0%)       19.08     (13.9%)  -19.0% ( -46% -   28%) 0.004
                  Fuzzy1       30.97     (31.6%)       25.82     (16.9%)  -16.6% ( -49% -   46%) 0.038
                PKLookup      213.23      (2.5%)      211.63      (1.8%)   -0.7% (  -5% -    3%) 0.289

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy1       20.59     (12.1%)       20.59     (10.5%)   -0.0% ( -20% -   25%) 0.994
                PKLookup      205.21      (3.1%)      206.99      (3.7%)    0.9% (  -5% -    7%) 0.422
                  Fuzzy2       30.74     (22.7%)       32.71     (17.0%)    6.4% ( -27% -   59%) 0.311

These results look strange to me actually, as I would imagine the BulkScorer without window one to perform similarly with the scorer one, as it was just using the scorer implementation under the hood. I'll need to dive into it more to understand what contributed to these difference (their JFR CPU recordings look similar too).

From the results I got now, it seems BMM may not be ideal for handling queries with many terms. My high level guess is that with these queries that can be rewritten into boolean queries with \~50 terms, BMM may find itself spending lots of time to compute upTo and update maxScore, as the minimum of all block boundaries of scorers were used to update upTo each time. This can explain why the bulkScorer implementation with a fixed window size has better performance than the scorer one, but doesn't explain the difference above.

I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers:

Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM?

Thanks for running these additional tests! The results indeed look interesting. I took a look at the MSMarcoPassages.java code you attached, and wonder if it's also possible that, since the percentile numbers were computed after sort, for some low percentile (P10 for example) BMM can do much better, and for the rest (at least 50% of them), worse than BMW?

I also notice that BMM BulkScorer collects roughly 10X the amount of docs compared with BMM scorer, which in turn also collects > 10X the amount of docs compared with BMW. I feel this may also explain the unexpected slow down? In general I would assume these scorers to all collect the same amount of top docs.

Also, I'm interested to run these benchmark tests as well. Are these passages data set and queries used available for download somewhere (I found the MS github site, but not sure if that has the same version with the one you used)?

Adrien Grand (@jpountz) (migrated from JIRA)

I also notice that BMM BulkScorer collects roughly 10X the amount of docs compared with BMM scorer, which in turn also collects > 10X the amount of docs compared with BMW. I feel this may also explain the unexpected slow down? In general I would assume these scorers to all collect the same amount of top docs.

Actually this matches my expectation. BMM and BMW differ in that BMM only makes a decision about which scorers lead iteration once per block, while BMW needs to make decisions on every document. So BMM collects more documents than BMW but BMW takes the risk that trying to be too smart makes things slower than a simpler approach.

Are these passages data set and queries used available for download somewhere

Yes. You can download the "Collection" and "Queries" files from https://microsoft.github.io/msmarco/#ranking (make sure to accept terms at the top first so that download links are active).

Zach Chen (@zacharymorn) (migrated from JIRA)

Actually this matches my expectation. BMM and BMW differ in that BMM only makes a decision about which scorers lead iteration once per block, while BMW needs to make decisions on every document. So BMM collects more documents than BMW but BMW takes the risk that trying to be too smart makes things slower than a simpler approach.

Ok I also took a further look at the TopDocsCollector code, and confirmed that I had an incorrect understanding of "collect" and "hit count" here earlier. This (and Michael's earlier response) totally makes sense now!

Yes. You can download the "Collection" and "Queries" files from https://microsoft\.github\.io/msmarco/#ranking (make sure to accept terms at the top first so that download links are active).

Thanks! I was able to download them. Will explore a bit more to see how they can be improved further.

Zach Chen (@zacharymorn) (migrated from JIRA)

Hi @jpountz, I've tried out a few ideas in the last few days and they gave some improvements (but also made it worse for OrMedMedMedMedMed). However, it was still not as performing as BMW for MSMARCO passages dataset. The ideas I tried include:

Move scorer from essential to non-essential list when minCompetitiveScore increases (mentioned in the paper)
1. commit: **https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]
Use score.score instead of maxScore for candidate doc evaluation against minCompetitiveScore to prune more docs (reverting your previous optimization)
1. commit: **https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]
Reduce maxScore contribution from non-essential list during candidate doc evaluation for scorer that cannot match
1. commit: https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d
Use the maximum of each scorer's upTo for maxScore boundary instead of minimum (opposed to what the paper suggested)
1. commit: https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188
2. This causes OrMedMedMedMedMed to degrade by 40%

Collectively, these gave 70\~90% performance boost to OrHighHigh, 60\~150% for OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of OrMedMedMedMedMed performance (by -40% with #4 changes).

For MSMARCO passages dataset, they now give the following results (modified slightly from your version to show more percentile, and to add comma to separate digits for readability):

BMW Scorer

AVG: 23,252,992.375
P25: 6,298,463
P50: 13,007,148
P75: 26,868,222
P90: 56,683,505
P95: 84,333,397
P99: 154,185,321
Collected AVG: 8,168.523
Collected P25: 1,548
Collected P50: 2,259
Collected P75: 3,735
Collected P90: 6,228
Collected P95: 13,063
Collected P99: 221,894

BMM Scorer

AVG: 41,970,641.638
P25: 8,654,210
P50: 21,553,366
P75: 51,519,172
P90: 109,510,378
P95: 154,534,017
P99: 266,141,446
Collected AVG: 16,810.392
Collected P25: 2,769
Collected P50: 7,159
Collected P75: 20,077
Collected P90: 43,031
Collected P95: 69,984
Collected P99: 135,253

I've also attached "JFR result for BMM scorer with optimizations May 22" for the BMM scorer profiling result from the latest changes. Overall, it seems that the larger number of docs collected by BMM is becoming a bottleneck for performance, as around 50% of the computation was spent by SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute score for candidate doc (around 34% of the computation was spent to find the next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more docs faster, it should be able to improve BMM further.

apache / lucene

Add a bulk scorer for disjunctions that does dynamic pruning [LUCENE-9335] #10375