aggregateknowledge / java-hll

Java library for the HyperLogLog algorithm
http://research.neustar.biz/2013/12/24/open-source-release-java-hll/
Apache License 2.0
311 stars 70 forks source link

Higher log2m value not always producing more accurate estimates #15

Open hossman opened 9 years ago

hossman commented 9 years ago

My understanding from the HLL algorithm (which may be flawed, in which case please correct me and close this issue) is that for any fixed set of input values, the accuracy of any estimate from an HLL built from those values should increase as the "m" value used in the HLL increases.

Ie:

if you build 2 HLL instances, with different log2m settings, and add the exact same set of (raw) values to both, then the HLL with the larger log2m will give you the most accurate results then the HLL with a smaller log2m setting.

In my testing however, I'm frequently encountering situations where "smaller" HLL instances are producing more accurate cardinality estimates -- which I can't explain.

I've created a reproducible test case that demonstrates the problem, which i will post as a separate comment.

hossman commented 9 years ago

Here's a test case demonstrating the issue (requires Guava to use murmur hash)...

package net.agkn.hll;

/*
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import static org.testng.Assert.assertEquals;
import static org.testng.Assert.assertTrue;
import static org.testng.Assert.assertFalse;

import com.google.common.hash.Hashing;
import com.google.common.hash.HashFunction;

import org.testng.annotations.Test;

/**
 * Tests {@link HLL} with differnet log2m params to confirm accuracy improvements
 *
 */
public class TestIncreasedAccuracy {

    final static int NUM_VALUES = 11762;
    final static long BIG_PRIME = 982451653L;
    final static long MAX_LONG = 3628504890999L; 
    final static HashFunction HASHER = Hashing.murmur3_128();

    @Test
    public void test_log2m14_Vs_log2m16() {

        final HLL hll_14 = new HLL(14, 6);
        final HLL hll_16 = new HLL(16, 6);
        // sanity check discrepency isn't because of explicit or sparse storage
        final HLL hll_16_nos = new HLL(16, 6, 0, false/* no sparse */, HLLType.FULL);

        long longValue = MAX_LONG;

        for (int i = 1; i <= NUM_VALUES; i++) {
            final long hashedValue = HASHER.hashLong(longValue).asLong();

            hll_14.addRaw(hashedValue);
            hll_16.addRaw(hashedValue);
            hll_16_nos.addRaw(hashedValue);

            longValue -= BIG_PRIME;
        }

        // sanity check how far things have promoted
        assertEquals(HLLType.FULL, hll_14.getType(), "hll_14 type");
        assertEquals(HLLType.SPARSE, hll_16.getType(), "hll_16 type");
        assertEquals(HLLType.FULL, hll_16_nos.getType(), "hll_16_nos type");

        final long hll_14_estimate = hll_14.cardinality();
        final long hll_16_estimate = hll_16.cardinality();
        final long hll_16_nos_estimate = hll_16_nos.cardinality();

        // sanity check that sparse and full log2m=16 representations have same estimate
        assertEquals(hll_16_estimate, hll_16_nos_estimate,
                     "sparse HLL had diff estimate from full HLL (log2m=16)");

        // FAILS: log2m=16 estimate (11739) is less accurate then log2m=14 estimate (11766) 
        assertTrue(Math.abs(NUM_VALUES - hll_16_estimate) <= Math.abs(NUM_VALUES - hll_14_estimate),
                   "log2m=16 estimate ("+hll_16_estimate+") is less accurate then "+
                   "log2m=14 estimate ("+hll_14_estimate+")");
    }
}