apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Try encoding very frequent terms using a dense bitmap #13147

Open msokolov opened 6 months ago

msokolov commented 6 months ago

Description

In case we have very dense terms we may be able to a better job encoding postings if we use a dense bitmap encoding where every doc is represented with a 1 (if it has the term) or 0 if it does not. For example there can be "boolean" fields used to indicate whether a doc is a parent or child doc (for use with ToParentBlockJoin).

In general if a term has freq P (docFreq/maxFreq) a dense encoding will require 1/P bits per doc (that is per posting, ie doc+term). In contrast our current PackedBits encoding requirement varies with the distribution of the terms among docs (since it encodes deltas). If the terms are uniformly distributed (every P docs) then the maximum gap is minimized and the bitrate will be ln2(P). But if the terms are not completely uniform among the docs it will require more. Modeling with a uniform random distribution we find that dense encoding can be more efficient for high P. I observed empirically that for P=1/2 PackedBits uses 3 bits per doc, P=1/4: 4 bits/doc and for P=1/8, 6 bits per doc. So there is a break-even at P=1/4 where the dense encoding and packed bits delta encoding have the same bitrate.

A dense encoding can have other benefits since it is very simple to read, supporting random access within a block. I think we should try making a block-by-block decision whether to use this encoding and see if we can find some benefit in reducing query time and possibly better compression as well.

msokolov commented 6 months ago

One question I have is how to indicate the dense encoding of a block. I see our blocks start with a single byte that indicates number of packed bits per doc, 0 means the block is totally dense, the lower 5 bits indicate the bits-per-doc, and the upper 3 are an exception count. Are we using the exception count? Oh I see our utilities still support it but it was removed from Lucene99PostingsFormat. Perhaps at least for experimentation purposes I can safely use one of those bits

msokolov commented 6 months ago

I have some initial implementation working in BlockDocsEnum, but one thing I'm unsure about is whether to provide it in all of the PostingsEnum/ImpactsEnum specializations. I feel like this is only likely to be used and helpful for docs-only fields (maybe?), but is it also possible that such fields can be iterated over by these other enums? I see that postings() returns a BlockDocsEnum when there are no positions, yet in tests I am seeing CheckIndex gets a handle on an EverythingEnum (and other enums) over a test field indexed with no positions and no freqs.

msokolov commented 6 months ago

I ran luceneutil over wikimediumall. The index size was slightly reduced:

65200   ../indices/baseline/facets
18923720        ../indices/baseline/index
18988924        ../indices/baseline
65204   ../indices/candidate/facets
18774956        ../indices/candidate/index
18840164        ../indices/candidate

in a microbenchmark where I indexed random doc-only postings I saw ~28% index size reduction.

query performance does seem to have registered some actual change:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value                                                         [178/1805]
                    OrHighNotLow      124.19      (6.0%)      111.98      (6.7%)   -9.8% ( -21% -    3%) 0.000         
                     LowSpanNear        1.50      (1.1%)        1.42      (1.2%)   -4.8% (  -7% -   -2%) 0.000
               HighTermTitleSort       86.63      (3.0%)       82.70      (2.2%)   -4.5% (  -9% -    0%) 0.000
             MedIntervalsOrdered        3.25      (4.3%)        3.11      (4.3%)   -4.1% ( -12% -    4%) 0.003
                      OrHighHigh       23.47      (6.7%)       22.61      (3.3%)   -3.7% ( -12% -    6%) 0.029
             LowIntervalsOrdered        4.20      (4.1%)        4.05      (4.1%)   -3.5% ( -11% -    4%) 0.007                                                                                
                     AndHighHigh       25.46      (8.5%)       24.57      (4.9%)   -3.5% ( -15% -   10%) 0.114  
     BrowseRandomLabelTaxoFacets        2.05     (14.8%)        1.98     (11.0%)   -3.4% ( -25% -   26%) 0.405                                                                                
            HighIntervalsOrdered        2.09      (5.3%)        2.02      (5.4%)   -3.1% ( -13% -    7%) 0.063                                                                                
                    HighSpanNear        4.25      (1.9%)        4.13      (2.0%)   -2.8% (  -6% -    1%) 0.000
                       OrHighMed       43.34      (3.1%)       42.18      (2.1%)   -2.7% (  -7% -    2%) 0.001
            BrowseDateTaxoFacets        2.78      (7.6%)        2.70      (6.6%)   -2.7% ( -15% -   12%) 0.234                                                                                
       BrowseDayOfYearTaxoFacets        2.81      (7.2%)        2.74      (6.2%)   -2.5% ( -14% -   11%) 0.236                                                                                
                         Prefix3      126.88      (2.3%)      123.78      (3.5%)   -2.4% (  -8% -    3%) 0.009                                                                                
                     MedSpanNear       11.93      (0.9%)       11.65      (1.1%)   -2.3% (  -4% -    0%) 0.000                                                                                
                    OrHighNotMed      141.45      (5.1%)      138.33      (7.0%)   -2.2% ( -13% -   10%) 0.254                                                                                
                      AndHighMed       36.62      (5.6%)       35.82      (3.1%)   -2.2% ( -10% -    6%) 0.124                                                                                
                       MedPhrase       67.69      (2.9%)       66.22      (2.6%)   -2.2% (  -7% -    3%) 0.013                                                                                
                HighSloppyPhrase       10.38      (1.6%)       10.20      (1.5%)   -1.8% (  -4% -    1%) 0.000                                                                                
                          IntNRQ        8.57     (14.4%)        8.42     (16.1%)   -1.8% ( -28% -   33%) 0.713
                        HighTerm      271.19      (4.0%)      266.87      (5.1%)   -1.6% ( -10% -    7%) 0.271
                 MedSloppyPhrase        8.12      (1.9%)        8.00      (2.5%)   -1.6% (  -5% -    2%) 0.028
                      HighPhrase       39.43      (3.8%)       38.94      (3.1%)   -1.2% (  -7% -    5%) 0.257
                         MedTerm      235.50      (3.4%)      232.58      (4.7%)   -1.2% (  -9% -    7%) 0.339
                       LowPhrase       46.81      (2.8%)       46.27      (2.3%)   -1.2% (  -6% -    4%) 0.157
                   OrHighNotHigh      147.42      (4.7%)      145.78      (6.2%)   -1.1% ( -11% -   10%) 0.525
                      TermDTSort       88.33      (2.8%)       87.38      (1.8%)   -1.1% (  -5% -    3%) 0.151
           HighTermDayOfYearSort      152.37      (2.1%)      150.79      (1.8%)   -1.0% (  -4% -    2%) 0.093
                         LowTerm      254.01      (1.9%)      251.72      (2.6%)   -0.9% (  -5% -    3%) 0.207
                 LowSloppyPhrase       24.52      (0.9%)       24.32      (1.4%)   -0.8% (  -3% -    1%) 0.029
                   OrNotHighHigh      199.37      (3.8%)      197.74      (4.9%)   -0.8% (  -9% -    8%) 0.557
               HighTermMonthSort     1581.75      (2.6%)     1569.14      (2.1%)   -0.8% (  -5% -    4%) 0.292
                    OrNotHighMed      134.43      (2.7%)      133.51      (3.3%)   -0.7% (  -6% -    5%) 0.471
                       OrHighLow      279.41      (2.1%)      277.84      (2.2%)   -0.6% (  -4% -    3%) 0.412
                          Fuzzy1       64.73      (1.5%)       64.48      (0.7%)   -0.4% (  -2% -    1%) 0.302
          OrHighMedDayTaxoFacets        3.84      (6.3%)        3.83      (5.4%)   -0.4% ( -11% -   12%) 0.845
         AndHighMedDayTaxoFacets       31.84      (1.2%)       31.74      (1.5%)   -0.3% (  -2% -    2%) 0.444
                          Fuzzy2       36.90      (1.3%)       36.80      (0.8%)   -0.3% (  -2% -    1%) 0.383
     BrowseRandomLabelSSDVFacets        1.57      (5.5%)        1.57      (3.8%)   -0.2% (  -9% -    9%) 0.906
                        PKLookup      140.43      (1.7%)      140.30      (2.1%)   -0.1% (  -3% -    3%) 0.876
                      AndHighLow      279.44      (2.2%)      279.34      (2.3%)   -0.0% (  -4% -    4%) 0.958
                    OrNotHighLow      345.34      (1.7%)      345.21      (1.9%)   -0.0% (  -3% -    3%) 0.948
                         Respell       33.36      (1.5%)       33.38      (1.3%)    0.1% (  -2% -    2%) 0.881
            MedTermDayTaxoFacets       10.12      (2.4%)       10.13      (2.4%)    0.1% (  -4% -    4%) 0.912
       BrowseDayOfYearSSDVFacets        2.32      (5.4%)        2.33      (3.3%)    0.1% (  -8% -    9%) 0.953
            HighTermTitleBDVSort        4.74      (3.3%)        4.74      (4.0%)    0.1% (  -6% -    7%) 0.902                                                                                
                        Wildcard      136.61      (2.5%)      136.82      (2.2%)    0.2% (  -4% -    4%) 0.831  
            BrowseDateSSDVFacets        0.68     (13.1%)        0.68     (13.0%)    0.4% ( -22% -   30%) 0.928                                                                                
           BrowseMonthTaxoFacets        2.84      (3.8%)        2.87      (1.3%)    1.1% (  -3% -    6%) 0.207                                                                                
           BrowseMonthSSDVFacets        2.38      (5.1%)        2.41      (4.0%)    1.3% (  -7% -   11%) 0.362
        AndHighHighDayTaxoFacets        3.23      (3.6%)        3.34      (2.9%)    3.5% (  -2% -   10%) 0.001

so this looks positive. I can try tuning the decision parameter controlling which encoding to use to see what impact that may have. I guess what I wonder is whether the added complexity is worth chasing this, but I'm pretty encouraged that the overhead of the conditionals isn't overwhelming the "within-block skipping" this affords.

msokolov commented 6 months ago

also -- the slowdown for AndHighHighDayTaxoFacets counters the overall trend. I wonder what's going on there.

msokolov commented 6 months ago

The results are kind of noisy -- on re-running AndHighHighDayTaxoFacets didn't show much change but there was a ~8.4% regression for AndHighMedDayTaxoFacets, and the cast of characters on the Plus side also look somewhat different

msokolov commented 6 months ago

I tried increasing the usage of dense encoding by enabling it when it would consume up to 3/2 as many bits as packed bits encoding, rather than using it only when it would use up to the same amount. The wikipedia index produced is larger than the first candidate but still smaller than the baseline. Query performance does seem somewhat improved. I'm going to stop posting walls of numbers here until I have a chance to try some further improvements:

  1. writing only the bytes needed rather than full-width longs
  2. possibly we can save some time in advance by combining bit-counting with finding the next set bit
  3. Maybe there is a way to save the overhead of the conditionals that determine which block encoding to decode. EG by introducing a block-reader although then we might get another function call overhead in place of the conditional
  4. I want to test with other Amazon indexes as well
  5. It would be nice to have a theory about why the faceting test cases seem to see worse perf. I guess they are different in that they do not use top-N collection, so no score-based skipping?
mikemccand commented 6 months ago

I am seeing CheckIndex gets a handle on an EverythingEnum (and other enums) over a test field indexed with no positions and no freqs.

Hmm, does CheckIndex pull all the various enums (docs only, docs+freqs, docs+freqs+positions, etc.) when testing? That might be a checking gap if not.

mikemccand commented 6 months ago

also -- the slowdown for AndHighHighDayTaxoFacets counters the overall trend. I wonder what's going on there.

Wait -- this task got faster right? And some others got slower, e.g. OrHighNotLow?

msokolov commented 6 months ago

oh dear you are correct - I was reading this table backwards! OK, so perhaps this "optimization" is not really helping very much at query time. Still it might be possible to squeeze some more juice from it.

mikemccand commented 6 months ago

I really love this idea! And it's wild that it's net/net reducing enwiki index size even at higher than expected cutover to dense encoding criteria!