aggregateknowledge / java-hll

Java library for the HyperLogLog algorithm
http://research.neustar.biz/2013/12/24/open-source-release-java-hll/
Apache License 2.0
312 stars 71 forks source link

Unusual behavior for HLL with register width 6 #10

Closed Akninirith closed 10 years ago

Akninirith commented 10 years ago

There seems to be some kind of a bug with HLL execution; when one instantiates an HLL as having a regwidth of 6, the cardinality returned is consistently 0 at large sample sizes. The following test was constructed and run in FullHLLTest. Could this be something to do with cutoff measurements? Thanks!

/**
 * Test case for 6-bit cutoff problems
 */
@Test
public void sixBitSixteenKRegisterCutoffBugTest() {
    Long temp = System.currentTimeMillis();

    // this does NOT work
    final HLL brokenHll4 = new HLL(13, 6);
    {
        Random rand = new Random(temp);
        for(int i=0; i<16384*16384; i++) {
            long random = rand.nextLong();
            brokenHll4.addRaw(random);
        }

        final long cardinality = brokenHll4.cardinality();

        assertTrue(cardinality > 0); // this does NOT work
    }

    // this DOES work
    final HLL brokenHll5 = new HLL(13, 5);
    {
        Random rand = new Random(temp);
        for(int i=0; i<16384*16384; i++) {
            long random = rand.nextLong();
            brokenHll5.addRaw(random);
        }

        final long cardinality = brokenHll5.cardinality();

        assertTrue(cardinality > 0); // this DOES work
    }
}
blinsay commented 10 years ago

Thanks for finding this!

It looks like HLLUtil#largeEstimator is returning NaN, which then turns into 0 when it gets cast to a long. The argument going into largeEstimator seems sane at first glance, but I'll have to take a deeper look later this week.

blinsay commented 10 years ago

It doesn't look like this was a regression introduced in any of the recent changes to the estimator code. I checked out de5cc2dcef88603386ca9b77e1f041ad213cc874 and it also has the same issue.

ghost commented 10 years ago

We hit a long overflow because computing 2L for L > 63 isn't possible with Java's 63 bit signed integer. To see this, look here and note that for regWidth = 6, log2m = 13 we have:

(((1 << regWidth) - 1 - 1) + log2m) = 64 - 2 + 13 = 75

and 275 is greater than Long.MAX_VALUE.

I believe we can safely move TWO_TO_L to be a double[] since all calculations that use it end up converting it to a double anyway.

ghost commented 10 years ago

I'll push a new jar to Sonnatype in a bit.

ghost commented 10 years ago

Sorry for the delay, but it should be in the central repo within a few hours.