Open hossman opened 9 years ago
Here's a test case demonstrating the issue (requires Guava to use murmur hash)...
package net.agkn.hll;
/*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import static org.testng.Assert.assertEquals;
import static org.testng.Assert.assertTrue;
import static org.testng.Assert.assertFalse;
import com.google.common.hash.Hashing;
import com.google.common.hash.HashFunction;
import org.testng.annotations.Test;
/**
* Tests {@link HLL} with differnet log2m params to confirm accuracy improvements
*
*/
public class TestIncreasedAccuracy {
final static int NUM_VALUES = 11762;
final static long BIG_PRIME = 982451653L;
final static long MAX_LONG = 3628504890999L;
final static HashFunction HASHER = Hashing.murmur3_128();
@Test
public void test_log2m14_Vs_log2m16() {
final HLL hll_14 = new HLL(14, 6);
final HLL hll_16 = new HLL(16, 6);
// sanity check discrepency isn't because of explicit or sparse storage
final HLL hll_16_nos = new HLL(16, 6, 0, false/* no sparse */, HLLType.FULL);
long longValue = MAX_LONG;
for (int i = 1; i <= NUM_VALUES; i++) {
final long hashedValue = HASHER.hashLong(longValue).asLong();
hll_14.addRaw(hashedValue);
hll_16.addRaw(hashedValue);
hll_16_nos.addRaw(hashedValue);
longValue -= BIG_PRIME;
}
// sanity check how far things have promoted
assertEquals(HLLType.FULL, hll_14.getType(), "hll_14 type");
assertEquals(HLLType.SPARSE, hll_16.getType(), "hll_16 type");
assertEquals(HLLType.FULL, hll_16_nos.getType(), "hll_16_nos type");
final long hll_14_estimate = hll_14.cardinality();
final long hll_16_estimate = hll_16.cardinality();
final long hll_16_nos_estimate = hll_16_nos.cardinality();
// sanity check that sparse and full log2m=16 representations have same estimate
assertEquals(hll_16_estimate, hll_16_nos_estimate,
"sparse HLL had diff estimate from full HLL (log2m=16)");
// FAILS: log2m=16 estimate (11739) is less accurate then log2m=14 estimate (11766)
assertTrue(Math.abs(NUM_VALUES - hll_16_estimate) <= Math.abs(NUM_VALUES - hll_14_estimate),
"log2m=16 estimate ("+hll_16_estimate+") is less accurate then "+
"log2m=14 estimate ("+hll_14_estimate+")");
}
}
My understanding from the HLL algorithm (which may be flawed, in which case please correct me and close this issue) is that for any fixed set of input values, the accuracy of any estimate from an HLL built from those values should increase as the "m" value used in the HLL increases.
Ie:
In my testing however, I'm frequently encountering situations where "smaller" HLL instances are producing more accurate cardinality estimates -- which I can't explain.
I've created a reproducible test case that demonstrates the problem, which i will post as a separate comment.