Use a bit set to count long-tail of singleton FacetLabels? [LUCENE-10080]

asfimport commented 3 years ago

I was talking about this with @rmuir about #11008, and he had a neat idea for more efficient facet counting.

Today we accumulate counts directly in an HPPC native int/int map, or a non-sparse int[] (if enough hits match the query).

But it is likely that many of these facet counts are singletons (occur only once in each query). To be more space efficient, we could wrap a bit set around the map or int[]. The first time we see an ordinal, we set its bit. The second and subsequent times, we increment the count as we do today.

If we use a non-sparse bitset (e.g. FixedBitSet) that will add some non-sparse heap cost O(maxDoc) for each segment, but if there are enough ordinals to count, that can be a win over just the HPPC native int map for some cases?

Maybe this could be an intermediate implementation, since we already cover the "very low hit count" (use HPPC int/int map) and "very high hit count" (using int[]) today?

Also, this bit set would be able to quickly iterate over the sorted ordinals, which might be helpful if we move the three big int[] into numeric doc values?

Migrated from LUCENE-10080 by Michael McCandless (@mikemccand), updated Oct 04 2021

asfimport commented 3 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

Interesting idea! +1 to experiment with this.

A couple questions (to make sure I'm understanding the idea):

Wouldn't the heap cost be O(maxOrdinal) (not maxDoc)?
Wouldn't it only make sense to wrap the int/int map (and never wrap the int[])? If we're allocating a dense int[] up-front, might as well count directly into it right instead of allocating more heap for the bitset?

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I kinda imagined something like this:

instead of:

// either add or increment a count in the hppc int/int map
hppcmap.insertOrIncrement(ordinal);

we do this:

if (longtail.getAndSet(ordinal)) {
  // bit was already set, so bump the hppc map counter (or add a new entry)
  hppcmap.insertOrIncrement(ordinal);
} else {
  // otherwise bit was not set before, we've recorded our count of 1.
}

The idea is that today, the long tail of ordinals with just count=1 are wasting a lot of space (32-bits at least plus whatever overhead that specialized map has). Instead we could use 1 bit-per-ordinal for these "exceptional cases" with something like FixedBitSet (maybe even SparseFixedBitSet later).

It is true the heap cost would be O(maxOrdinal). But I think in case of a large number of ordinals with a zipfian distribution, it could save space overall, esp. If the query is a worst-case one (such as MatchAllDocsQuery).

This approach doesn't make sense to wrap the int[], only the specialized map, for the int[] it would just add more memory space and cpu cycles. But if the number of unique ordinals is large it makes sense to try to not be so wasteful and allocate tons of int[]'s that are mostly zeros.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

also the heap cost is already maxOrdinal for all this faceting crap. its just about constants here. i'm sure inserting an entry into a native int/int hashmap requires at least 64 bits (32 bits for the key, 32 bits for the value). it is only 1 bit with a bitset.

asfimport commented 3 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

Thanks @rmuir, this all makes sense. I think we've got a common understanding :)

asfimport commented 3 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

Hi, I'm working on this issue.

asfimport commented 3 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

Awesome @mdmarshmallow! I'm really interested in what you discover :)

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

Here is the pull request link for the change: https://github.com/apache/lucene/pull/288

I ran this change on wikimediumall and got the following results:

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       HighTermMonthSort       31.46     (18.1%)       30.46     (17.8%)   -3.2% ( -33% -   39%) 0.573
                  IntNRQ       26.95     (37.9%)       26.40     (35.2%)   -2.1% ( -54% -  114%) 0.859
              TermDTSort       44.88     (17.3%)       44.04     (16.8%)   -1.9% ( -30% -   39%) 0.728
                  Fuzzy1       51.03      (8.8%)       50.21      (9.3%)   -1.6% ( -18% -   18%) 0.578
   HighTermDayOfYearSort       62.41     (11.2%)       61.41     (10.7%)   -1.6% ( -21% -   22%) 0.646
              AndHighMed       61.78      (7.2%)       60.91      (6.9%)   -1.4% ( -14% -   13%) 0.524
BrowseDayOfYearSSDVFacets        4.43      (7.5%)        4.39      (6.8%)   -1.0% ( -14% -   14%) 0.647
           OrNotHighHigh      454.43      (7.4%)      449.88      (6.7%)   -1.0% ( -14% -   14%) 0.654
   BrowseMonthSSDVFacets        4.87      (7.3%)        4.82      (6.6%)   -1.0% ( -13% -   13%) 0.653
              AndHighLow      484.80      (7.8%)      480.21      (8.4%)   -0.9% ( -15% -   16%) 0.712
    HighIntervalsOrdered        1.14      (7.8%)        1.13      (7.8%)   -0.9% ( -15% -   16%) 0.730
        HighSloppyPhrase       11.92      (7.8%)       11.82      (7.6%)   -0.8% ( -15% -   15%) 0.737
   BrowseMonthTaxoFacets        0.82      (8.6%)        0.81      (8.0%)   -0.8% ( -16% -   17%) 0.764
               OrHighMed       53.48      (7.0%)       53.09      (6.6%)   -0.7% ( -13% -   13%) 0.734
              OrHighHigh        8.81      (6.4%)        8.75      (6.1%)   -0.7% ( -12% -   12%) 0.716
    BrowseDateTaxoFacets        0.78      (8.4%)        0.77      (8.2%)   -0.7% ( -15% -   17%) 0.786
                Wildcard       62.05      (8.6%)       61.66     (10.0%)   -0.6% ( -17% -   19%) 0.830
BrowseDayOfYearTaxoFacets        0.78      (8.4%)        0.77      (8.2%)   -0.6% ( -15% -   17%) 0.821
               MedPhrase       19.38      (8.0%)       19.27      (7.8%)   -0.6% ( -15% -   16%) 0.819
                 Respell       28.41      (7.7%)       28.25      (7.2%)   -0.5% ( -14% -   15%) 0.820
             AndHighHigh       55.43      (7.1%)       55.15      (6.2%)   -0.5% ( -12% -   13%) 0.812
         MedSloppyPhrase       88.90      (8.9%)       88.47      (8.5%)   -0.5% ( -16% -   18%) 0.860
               LowPhrase       18.44      (6.8%)       18.37      (7.4%)   -0.4% ( -13% -   14%) 0.849
     MedIntervalsOrdered        6.51      (7.8%)        6.49      (7.2%)   -0.4% ( -14% -   15%) 0.880
             LowSpanNear       10.95      (7.0%)       10.91      (6.7%)   -0.3% ( -13% -   14%) 0.878
            OrHighNotMed      544.36      (8.1%)      542.80      (7.6%)   -0.3% ( -14% -   16%) 0.909
             MedSpanNear        9.89      (7.8%)        9.87      (6.9%)   -0.2% ( -13% -   15%) 0.925
         LowSloppyPhrase       14.41      (7.4%)       14.38      (6.8%)   -0.2% ( -13% -   15%) 0.931
     LowIntervalsOrdered       36.25      (7.8%)       36.22      (7.1%)   -0.1% ( -13% -   16%) 0.972
              HighPhrase      115.92      (8.5%)      115.85      (8.1%)   -0.1% ( -15% -   18%) 0.982
                PKLookup      160.74      (6.7%)      160.80      (6.1%)    0.0% ( -11% -   13%) 0.985
                 Prefix3       72.60      (8.9%)       72.64      (9.2%)    0.1% ( -16% -   20%) 0.984
            HighSpanNear        2.85      (7.9%)        2.86      (6.6%)    0.3% ( -13% -   16%) 0.912
            OrHighNotLow      570.42      (8.5%)      572.40      (8.4%)    0.3% ( -15% -   18%) 0.896
            OrNotHighMed      581.16      (7.9%)      584.23      (9.2%)    0.5% ( -15% -   19%) 0.845
            OrNotHighLow      564.66      (8.5%)      568.35      (9.5%)    0.7% ( -15% -   20%) 0.819
                HighTerm      713.52      (7.0%)      718.50      (7.0%)    0.7% ( -12% -   15%) 0.751
    HighTermTitleBDVSort       66.21     (17.1%)       66.79     (18.1%)    0.9% ( -29% -   43%) 0.875
                 MedTerm     1140.30      (6.1%)     1150.77      (7.4%)    0.9% ( -11% -   15%) 0.670
           OrHighNotHigh      433.10      (7.9%)      440.81      (8.3%)    1.8% ( -13% -   19%) 0.488
                  Fuzzy2       46.81     (14.9%)       47.66     (13.7%)    1.8% ( -23% -   35%) 0.688
               OrHighLow      331.28      (8.7%)      338.74      (9.4%)    2.3% ( -14% -   22%) 0.432
                 LowTerm     1459.96      (8.9%)     1493.52      (7.5%)    2.3% ( -12% -   20%) 0.377

This change shouldn't affect the QPS, but it should ideally improve heap usage. I could do method profiling to get an idea of heap usage changes, but is there other ways I can test to see if there is a heap usage reduction? Thanks!

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

I accidentally deleted my forked branch and therefore closed the linked pull request I put earlier. Here is a new one: https://github.com/apache/lucene/pull/304

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

@mdmarshmallow thanks for diving into this! I'll see if I can dig in today or tomorrow and provide some feedback. Thanks again!

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

Hey @mdmarshmallow thanks for the PR! It looks like you isolated your solution to only taxonomy faceting. I think idea this is actually applicable to a number of different Facets implementations. Did you want to tackle all of them, or were you just starting with taxonomy as a first step? Thanks again!

asfimport commented 2 years ago

Marc D'Mello (@mdmarshmallow) (migrated from JIRA)

Hi @gsmiller, I looked into it some more and I think it would be a good idea to start with IntTaxonomyFacets and StringValueFacetCounts as those would be the easiest to implement. I also think it might be useful to improve faceting benchmarks to better test my change. I know you opened an issue about that but I opened an issue that was more specific to this change here.

asfimport commented 2 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

Sound good to me. Thanks @mdmarshmallow!

apache / lucene

Use a bit set to count long-tail of singleton FacetLabels? [LUCENE-10080] #11118