aggregateknowledge / java-hll

Java library for the HyperLogLog algorithm
http://research.neustar.biz/2013/12/24/open-source-release-java-hll/
Apache License 2.0
312 stars 71 forks source link

Error rate is more than theoretical error rate #28

Open shreshthagarg21 opened 3 years ago

shreshthagarg21 commented 3 years ago

The research paper claims the error rate to be 1.04/sqrt(m)

For log2m = 11 and w = 5 the theoretical error rate should be ~2.3% but in our tests we have seen a variance of upto 3.5% for inserting 100K unique alphanumeric UUIDs. Below is our pseudo code :

HLL hll = new HLL( 11, 5 );
HashFunction hashFunction = Hashing.murmur3_128();

for (String str : UUIDs) {
  Hasher hasher = hashFunction.newHasher();
  hasher.putString(str, StandardCharsets.UTF_8);
  hll.addRaw(hasher.hash().asLong());
}

How can we reduce this additional 1.2% variance and fall in the theoretically claimed error rate ?