Broadword bit selection [LUCENE-5098]

asfimport commented 11 years ago

Migrated from LUCENE-5098 by Paul Elschot, 2 votes, resolved Jul 15 2013 Attachments: LUCENE-5098.patch (versions: 2)

asfimport commented 11 years ago

This has some methods and constants inspired by the article "Broadword Implementation of Rank/Select Queries" by Sebastiano Vigna, January 30, 2012, currently retrievable from http://vigna.di.unimi.it/ftp/papers/Broadword.pdf .

I'd expect this to become really useful in the Elias-Fano sequence of #6148 after an index is added to that.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The patch looks good to me. I'm looking forward to seeing the impact these methods will have on the EliasFanoDocIdSet performance!

Some minor comments about the patch:

Maybe these utility methods could be moved to oal.util.BitUtil where we already have some utility methods to count or find bits?
I ran quick benchmarks for select and select9 seemed to quickly outperform selectNaive, even for non-dense longs. Do you have the same experience with selectNaive vs. select9? If yes, maybe selectNaive could be moved to the test case.
Since the main goal of rank9 is code documentation, maybe it should be made package-private?

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

TestBroadWord extends TestCase

Could we extend LuceneTestCase for tests, please? I know it doesn't have any impact here but if it's good to have a single superclass for all tests in case we need to tweak/ alter something.

As for longHex – I always liked the "assembly" version better:

    final static char [] HEX = "0123456789abcdef".toCharArray();
    private static String longHex(long x)
    {
        char [] asHex = new char [16];
        for (int i = 16; --i >= 0; x >>>= 4) {
            asHex[i] = HEX[(int) x & 0x0F];
        }
        return "0x" + new String(asHex);
    }

but if you strive for simplicity this should do too:

    private static String longHex(long x)
    {
        return String.format(Locale.ENGLISH, "%016x", x);
    }

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Could we extend LuceneTestCase for tests, please?

I think forbidden checker should catch this :-)

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

It does.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Maybe these utility methods could be moved to oal.util.BitUtil where we already have some utility methods to count or find bits?

I kept them separate to have a clear reference to the Vigna broadword article.

The class javadoc already refers to all methods and constants from there, so that could be easily merged into BitUtil's class javadoc when merging the two classes. Also both classes have only static attributes.

So merging is no problem at all, but I'm still in two minds about this. Any opinions either way, i.e. merging class BroadWord here into the existing BitUtil or not?

I ran quick benchmarks for select and select9 seemed to quickly outperform selectNaive, even for non-dense longs. Do you have the same experience with selectNaive vs. select9? If yes, maybe selectNaive could be moved to the test case.

In the Quasi Succinct Indices article (EliasFanoDocIdSet, LUCENE-5084), in Section 9 under Reading Unary Codes, Vigna mentions that within the right longword:

Our experiments show that broadword bit search is extremely effective, unless the number of reads is very small, as in that case computing iteratively the least significant bit becomes competitive. Indeed, when skipping a very small number of position (e.g., less then eight) we simply resort to iterating through the list.

In the test code there is a comment that select9 is about 7 times faster than selectNaive, 45ns against 295 ns for a very dense case on my old 32 bit home machine. Since the machine is 32 bits, that is no more than a rough indication that the selectNaive here (that iterates) might indeed preferable over the select9 here when selecting before the 8th set bit. A safe conclusion is that moving selectNaive to the test cases now would be premature.

Since the main goal of rank9 is code documentation, maybe it should be made package-private?

I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private.

As for longHex – I always liked the "assembly" version better.

I have used that mostly during debugging, and it ended up also in toString in EliasFanoEncoder iirc. How about putting the assembly version in BitUtil?

Could we extend LuceneTestCase for tests, please?

I don't really like it, but my next patch here will have that. Should LuceneTestCase also be mentioned in the wiki at "How to contribute"?

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't really like it, but my next patch here will have that.

Is there any reason why you don't like it? Voices of criticism are always welcome, especially those constructive ones ;)

The only change for you should be that it'll require JVM assertions to be enabled and that the method order will be randomized (which you can fix by putting a Seed annotation on the class).

There are a few test classes that in fact do not extend LuceneTestCase but they do it for a reason (they "test the tester" and LuceneTestCase is not really reentrant).

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I don't really like using LuceneTestCase here because the tests here pass when extending TestCase and they do not use anything that LuceneTestCase provides, at least not in the way that I use them:

 ant -Dtestcase=TestBroadWord test

. Also I prefer to try and avoid having that discussion here.

Is the forbidden checker a precommit check?

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

You can call ant check-forbidden-apis from the lucene or solr root directory. And yes, it's also a precommit check.

LuceneTestCase does more checks on your test case, so we require every test in Lucene extends this class.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

A safe conclusion is that moving selectNaive to the test cases now would be premature.

OK.

I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private.

I played a bit with it and rank9 was always between 15% and 20% slower than bitCount no matter what the input was (which is still impressing since bitCount is supposed to be an intrinsic). We used to have a utility method in BitUtil to compute pop counts on longs but we removed it in #3297 in favor of Long.bitCount.

How about putting the assembly version in BitUtil?

Or ToStringUtils?

Should LuceneTestCase also be mentioned in the wiki at "How to contribute"?

We try to keep this page as concise as possible so I added a mention to it in https://wiki.apache.org/lucene-java/DeveloperTips.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't really like using LuceneTestCase here because the tests here pass when extending TestCase and they do not use anything that LuceneTestCase provides

If you prefer not to have that discussion here I will shut up. I just wanted to clarify that LTC does checks on your test that you bypass when you extend from TestCase. Randomization of method order (catches order dependencies that shouldn't be there), ensuring your tests run with assertions enabled, checking if your tests don't spawn extra threads – all these (and more) may not matter to you because you're a seasoned engineer but they do matter in general when contributions come from various sources (or are refactored later by people other than the original author). Randomization isn't the only goal of LTC.

In short: it's not a remark to you personally, it's a remark to everyone in general that extending LTC should be a priority because it will catch faulty tests faster.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I played a bit with it and rank9 was always between 15% and 20% slower than bitCount no matter what the input was (which is still impressing since bitCount is supposed to be an intrinsic)

I also toyed with this a bit. Interesting because rank9 is essentially Hacker's delight implementation, but with a multiplication at the end rather than shifts/ additions (which I think originated from the fact that multiplication used to be much slower than additions/shifts back in the day).

I ran a few caliper bechmarks on a million longs, different distributions (Intel I7, JDK17) just to see what the output will be.

   benchmark distribution     us linear runtime
    HdPopCnd        ZEROS   2333 =
    HdPopCnd         FULL   2334 =
    HdPopCnd       RANDOM   2329 =
    HdPopCnd       ONEBIT   2334 =
       Rank9        ZEROS   1651 =
       Rank9         FULL   1652 =
       Rank9       RANDOM   1651 =
       Rank9       ONEBIT   1651 =
LongBitCount        ZEROS    411 =
LongBitCount         FULL    394 =
LongBitCount       RANDOM    391 =
LongBitCount       ONEBIT    404 =
 NaivePopCnt        ZEROS    585 =
 NaivePopCnt         FULL  39019 ======
 NaivePopCnt       RANDOM 171365 ==============================
 NaivePopCnt       ONEBIT  28155 ====

The naive loop was:

        int cnt = 0;
        while (x != 0) {
            if (((x >>>= 1) & 1) != 0L) {
                cnt++;
            }
        }
        return cnt;

and you can see that even for all zeros (when in fact there is no counting at all) it's still slower than the intrinsified popcnt. Note full-ones is not the worst case (I believe due to constant branch misprediction in a random input).

A zoomed-in benchmark without the naive impl.:

   benchmark distribution   us linear runtime
    HdPopCnd        ZEROS 2331 =============================
    HdPopCnd         FULL 2329 =============================
    HdPopCnd       RANDOM 2333 =============================
    HdPopCnd       ONEBIT 2333 =============================
       Rank9        ZEROS 1650 =====================
       Rank9         FULL 1650 =====================
       Rank9       RANDOM 1652 =====================
       Rank9       ONEBIT 1650 =====================
LongBitCount        ZEROS  400 =====
LongBitCount         FULL  402 =====
LongBitCount       RANDOM  401 =====
LongBitCount       ONEBIT  391 =====

You can see when popcnt isn't intrinsified by running with IBM's J9, for example:

   benchmark distribution   ms linear runtime
LongBitCount        ZEROS 2.25 =============================
LongBitCount         FULL 2.22 =============================
LongBitCount       RANDOM 2.25 =============================
LongBitCount       ONEBIT 2.25 =============================
    HdPopCnd        ZEROS 2.25 =============================
    HdPopCnd         FULL 2.25 =============================
    HdPopCnd       RANDOM 2.22 =============================
    HdPopCnd       ONEBIT 2.22 =============================
       Rank9        ZEROS 1.62 =====================
       Rank9         FULL 1.62 =====================
       Rank9       RANDOM 1.62 =====================
       Rank9       ONEBIT 1.62 =====================

But I think they'll eventually catch up with modern cpus too so I'd stick with Long.bitCount.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

In the naive implementation above the first shift is before the first masked test. Does this miss the lowest bit? Not that it matters much...

I'll provide a new patch with:

rank9 package private, users of IBM's J9 will know what to do,
longHex in ToStringUtils, and
extends LuceneTestCase.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Indeed, it's a regression from an earlier implementation where I used Long.rotateRight. It won't play a difference but well spotted.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Looks good to me (there's a likely unused import but it can be fixed later). Very cool, btw.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1502690 from @jpountz https://svn.apache.org/r1502690

LUCENE-5098: Broadword utility methods.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1502691 from @jpountz https://svn.apache.org/r1502691

LUCENE-5098: Broadword utility methods (merged from r1502690).

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Committed. Thanks Paul and Dawid!

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

4.5 release -> bulk close

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

On reviewing the article I found that the method names I chose for algorithms 1 and 2 in the VIgna article were wrong. I used rank9 and select9 because of the titles of sections 3 and 5 in the article, but the method names do less than what is described in these sections. The method names should have been bitCount and select. I'll add a correction for this at #6300, there is no need to reopen this issue.

apache / lucene

Broadword bit selection [LUCENE-5098] #6162