apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

Faster search APIs for doc values [LUCENE-7462] #8514

Closed asfimport closed 7 years ago

asfimport commented 8 years ago

While the iterator API helps deal with sparse doc values more efficiently, it also makes search-time operations more costly. For instance, the old random-access API allowed to compute facets on a given segment without any conditionals, by just incrementing the counter at index ordinal+1 while the new API requires to advance the iterator if necessary and then check whether it is exactly on the right document or not.

Since it is very common for fields to exist across most documents, I suspect codecs will keep an internal structure that is similar to the current codec in the dense case, by having a dense representation of the data and just making the iterator skip over the minority of documents that do not have a value.

I suggest that we add APIs that make things cheaper at search time. For instance in the case of SORTED doc values, it could look like LegacySortedDocValues with the additional restriction that documents can only be consumed in order. Codecs that can implement this API efficiently would hide it behind a SortedDocValues adapter, and then at search time facets and comparators (which liked the LegacySortedDocValues API better) would either unwrap or hide the SortedDocValues they got behind a more random-access API (which would only happen in the truly sparse case if the codec optimizes the dense case).

One challenge is that we already use the same idea for hiding single-valued impls behind multi-valued impls, so we would need to enforce the order in which the wrapping needs to happen. At first sight, it seems that it would be best to do the single-value-behind-multi-value-API wrapping above the random-access-behind-iterator-API wrapping. The complexity of wrapping/unwrapping in the right order could be contained in the DocValues helper class.

I think this change would also simplify search-time consumption of doc values, which currently needs to spend several lines of code positioning the iterator everytime it needs to do something interesting with doc values.


Migrated from LUCENE-7462 by Adrien Grand (@jpountz), resolved Oct 24 2016 Attachments: LUCENE-7462.patch, LUCENE-7462-advanceExact.patch Linked issues:

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Thanks for bringing this up. As I was reviewing the big commit for the iterator style, I kept seeing the condition over and over again like you do. I thought to myself, maybe we should at least have some convenience method, possibly overridable by the codec/impl.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm not convinced this is necessary. I think it may be a step backwards, and we should instead spend our effort making a codec who's advance impls is fast.

Wouldn't this mean we'd need 2X the search-time code, to handle the "yeah codec secretly is actually random access so I specialize that case" and "no, it's really an iterator" cases?

I think this change would also simplify search-time consumption of doc values, which currently needs to spend several lines of code positioning the iterator everytime it needs to do something interesting with doc values.

Likely some of the places I had to fix to use an iterator could be improved, e.g. if they could know their DV iterator was not already on the current doc, they could blindly call advance.

asfimport commented 8 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Wouldn't this mean we'd need 2X the search-time code [...]

If there were a utility to always get you a random access API? Perhaps not. It does seem like a majority of consumers would want the random access API only... things like grouping, sorting, and faceting are all driven off of document ids. For each ID, we check the docvalues. We don't actually do skipping/leapfrogging like a filter would do since we still need to do work for each document, even if the DV doesn't exist for that document.

I haven't thought about what this means for code further down the stack, but it does seem worth exploring in general.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I have been playing with the idea of having an advanceExact method (which I guess is the alternative to adding a 2nd search API for doc values). It removes stress on consumers since this method can be called blindly since it does not advance beyond the target document. It also removes some stress on the codec since it doesn't have to find the next document that has a value anymore.

I ran the wikimedium10m benchmark, to which I added the sorting tasks from the nigthly benchmark to check the impact. There seems to be a consistent speedup for queries for which norms is the bottleneck (term queries and simple conjunctions/disjunctions) and sorted queries (TermTitleSort and TermDTSort).

                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                  Fuzzy2       55.31     (20.1%)       54.45     (18.5%)   -1.6% ( -33% -   46%)
            OrNotHighLow      875.16      (3.3%)      870.60      (2.9%)   -0.5% (  -6% -    5%)
         MedSloppyPhrase      210.38      (3.9%)      209.40      (3.8%)   -0.5% (  -7% -    7%)
         LowSloppyPhrase      126.86      (2.5%)      126.74      (2.1%)   -0.1% (  -4% -    4%)
              AndHighMed      151.22      (1.7%)      151.30      (2.3%)    0.0% (  -3% -    4%)
             LowSpanNear       20.08      (2.6%)       20.10      (2.9%)    0.1% (  -5% -    5%)
                 Respell       77.27      (3.8%)       77.36      (3.5%)    0.1% (  -6% -    7%)
               LowPhrase       42.32      (2.1%)       42.40      (1.9%)    0.2% (  -3% -    4%)
              HighPhrase       20.01      (4.1%)       20.06      (3.7%)    0.3% (  -7% -    8%)
                Wildcard       46.20      (3.5%)       46.32      (3.9%)    0.3% (  -6% -    7%)
        HighSloppyPhrase       15.99      (5.1%)       16.04      (4.9%)    0.3% (  -9% -   10%)
                 Prefix3       43.21      (2.9%)       43.39      (3.1%)    0.4% (  -5% -    6%)
               MedPhrase      151.07      (3.4%)      151.69      (3.7%)    0.4% (  -6% -    7%)
            OrNotHighMed      151.21      (2.3%)      151.98      (2.6%)    0.5% (  -4% -    5%)
             AndHighHigh       58.73      (1.4%)       59.05      (1.4%)    0.5% (  -2% -    3%)
             MedSpanNear       22.36      (1.6%)       22.48      (1.6%)    0.6% (  -2% -    3%)
                  IntNRQ       13.75     (12.5%)       13.83     (13.1%)    0.6% ( -22% -   29%)
            OrHighNotMed       62.26      (2.7%)       62.70      (3.2%)    0.7% (  -5% -    6%)
           OrNotHighHigh       58.38      (2.6%)       58.82      (2.4%)    0.7% (  -4% -    5%)
            HighSpanNear       39.78      (2.2%)       40.09      (3.0%)    0.8% (  -4% -    6%)
           OrHighNotHigh       44.88      (2.8%)       45.29      (2.7%)    0.9% (  -4% -    6%)
              AndHighLow      694.25      (4.8%)      703.66      (3.8%)    1.4% (  -6% -   10%)
               OrHighLow       91.20      (3.4%)       92.54      (3.7%)    1.5% (  -5% -    8%)
            OrHighNotLow      105.90      (3.0%)      107.79      (4.4%)    1.8% (  -5% -    9%)
                  Fuzzy1       79.92     (12.3%)       81.61     (12.1%)    2.1% ( -19% -   30%)
              OrHighHigh       29.18      (7.2%)       29.83      (7.3%)    2.2% ( -11% -   18%)
               OrHighMed       19.44      (7.2%)       19.89      (7.3%)    2.3% ( -11% -   18%)
           TermTitleSort       81.70      (5.6%)       83.67      (5.8%)    2.4% (  -8% -   14%)
                 LowTerm      682.24      (4.5%)      704.58      (4.1%)    3.3% (  -5% -   12%)
              TermDTSort      103.25      (5.7%)      106.77      (4.0%)    3.4% (  -5% -   13%)
                 MedTerm      249.00      (2.5%)      260.56      (3.2%)    4.6% (  -1% -   10%)
                HighTerm      103.70      (3.2%)      109.27      (3.6%)    5.4% (  -1% -   12%)

Note that the patch has barely any tests, so it's really just for playing. :) We'd also still need to define the semantics of this method.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I also see good speedups to the otherwise "lightweight" queries:

Report after iter 19:
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                 Prefix3       43.40      (5.2%)       42.48      (8.8%)   -2.1% ( -15% -   12%)
                  IntNRQ       10.05      (8.8%)        9.87     (10.5%)   -1.8% ( -19% -   19%)
            HighSpanNear       19.38      (5.2%)       19.14      (6.6%)   -1.2% ( -12% -   11%)
               LowPhrase       19.34      (1.9%)       19.21      (3.6%)   -0.7% (  -6% -    4%)
                PKLookup      350.45      (1.3%)      348.51      (2.8%)   -0.6% (  -4% -    3%)
             MedSpanNear       41.12      (4.5%)       40.98      (4.7%)   -0.4% (  -9% -    9%)
                  Fuzzy1      115.35      (2.3%)      115.06      (2.8%)   -0.2% (  -5% -    5%)
             LowSpanNear       85.93      (2.1%)       85.78      (2.3%)   -0.2% (  -4% -    4%)
               MedPhrase       77.08      (2.7%)       77.03      (2.9%)   -0.1% (  -5% -    5%)
                 Respell       62.22      (2.2%)       62.26      (1.4%)    0.1% (  -3% -    3%)
                Wildcard       37.39      (4.4%)       37.43      (5.8%)    0.1% (  -9% -   10%)
                  Fuzzy2      100.18      (2.0%)      100.31      (1.6%)    0.1% (  -3% -    3%)
         LowSloppyPhrase       14.75      (4.9%)       14.79      (4.2%)    0.2% (  -8% -    9%)
              HighPhrase        3.81      (5.2%)        3.82      (6.2%)    0.4% ( -10% -   12%)
              AndHighLow      912.50      (2.5%)      916.11      (3.8%)    0.4% (  -5% -    6%)
            OrNotHighLow      957.24      (2.5%)      963.91      (2.7%)    0.7% (  -4% -    6%)
         MedSloppyPhrase       48.46      (4.8%)       48.80      (4.3%)    0.7% (  -8% -   10%)
              AndHighMed       46.40      (1.7%)       46.87      (1.6%)    1.0% (  -2% -    4%)
             AndHighHigh       43.36      (1.9%)       43.80      (1.9%)    1.0% (  -2% -    4%)
                 LowTerm      449.83      (2.5%)      454.76      (5.1%)    1.1% (  -6% -    8%)
        HighSloppyPhrase       16.13      (6.8%)       16.34      (6.3%)    1.3% ( -11% -   15%)
            OrNotHighMed       98.19      (3.2%)       99.56      (3.1%)    1.4% (  -4% -    7%)
           OrNotHighHigh       21.69      (4.5%)       22.16      (4.8%)    2.2% (  -6% -   12%)
           OrHighNotHigh       18.16      (7.7%)       18.75      (8.0%)    3.2% ( -11% -   20%)
            OrHighNotMed       61.81      (9.4%)       64.27      (9.5%)    4.0% ( -13% -   25%)
                 MedTerm      123.87      (4.5%)      129.22      (3.3%)    4.3% (  -3% -   12%)
            OrHighNotLow       25.19     (11.2%)       26.28     (11.5%)    4.4% ( -16% -   30%)
              OrHighHigh       12.29      (7.4%)       12.96      (8.7%)    5.5% (  -9% -   23%)
               OrHighMed       12.36      (7.4%)       13.09      (8.5%)    5.9% (  -9% -   23%)
                HighTerm       38.51      (5.7%)       40.80      (4.4%)    5.9% (  -3% -   17%)
               OrHighLow       19.42      (8.6%)       20.66      (9.7%)    6.4% ( -10% -   26%)
asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sorry, that was wikimediumall that I ran, 20 JVM iters, multiple iters per JVM, multiple concurrent queries, etc.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is a patch that tries to implement this advanceExact method on all codecs. Initially I wanted to require that the target is strictly greater than the current doc id but this caused issues with comparators that may need to get the value multiple times or with scorers that call Scorer.score() multiple times (which makes the norm be decoded twice). So the current patch only requires that the target is greater than or equal to the current document. I managed to get the whole test suite passing twice in a row and luceneutil still gives results that are similar to above.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to the semantics and the patch. Thanks @jpountz!

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 9aca4c9d56089a9ac89df5fd93be76a4fe822448 in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9aca4c9

LUCENE-7462: Give doc values APIs an advanceExact method.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I merged the proposed change. I'll keep an eye on the nightly benchmarks to verify there is a speedup as expected.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 97339e2cacc308c3689d1cd16dfbc44ebea60788 in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=97339e2

LUCENE-7462: Fix LegacySortedSetDocValuesWrapper to reset upTo when calling advanceExact.

asfimport commented 7 years ago

ASF subversion and git services (migrated from JIRA)

Commit 71c65184562499eba365d166fe3fabe0dbdc747b in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=71c6518

LUCENE-7462: Fix buggy advanceExact impl of empty binary doc values.