wrong cardinality estimates

B3AU commented 9 years ago

I'm getting getting estimates which are way off. Here's some simple code to show the problem:

import HLL
from random import choice

#Sample 5000 numbers from [0..1000], i.e. ~20% uniques 
possible = range(1000)
sample = [choice(possible) for _ in range(5000)] #sample with repetition

print "True cardinality: {}".format(len(set(sample)))

hll = HLL.HyperLogLog(14)
for s in sample:
  hll.add(str(s))
print "Estimated cardinality: {}".format(hll.cardinality())

which outputs:

True cardinality: 987
Estimated cardinality: 1431.81903039

Am I missing something? This seems to be basic functionality. Otherwise, the library works great. Super fast compared to https://github.com/svpcom/hyperloglog (which does give correct estimates)

B3AU commented 9 years ago

Seems like the problem comes from having a small cardinality compared to k, resulting in many zero registers and a bad estimate. The svpcom pure python implementation adds a bias correction to solve this problem.

ascv commented 9 years ago

Am I missing something?

Yes. In your example linear counting is being used since the raw estimate is less than 5/2*m. Linear counting estimates the cardinality using

m*log(m/V)

where m=number of registers and V=number of zero registers. You can see this by modifying your example:

regs = hll.registers()
print sum(map(lambda x: 1 if x == 0 else 0, regs))
print "Estimated cardinality: {}".format(hll.cardinality())

In my example I get the output:

True cardinality: 996
15413
Estimated cardinality: 1444.08451191

Note that

2^14*log2(2^14/15413) = 1444.085

The svpcom implementation handles small ranges (mostly) the same:

 #count number or registers equal to 0
V = self.M.count(0)

if V > 0:
    H = self.m * math.log(self.m / float(V))
    return H if H <= get_treshold(self.p) else self._Ep()
else:
    return self._Ep()

As you can see, when V > 0 a linear count is performed. If the linear count estimate is less than the threshold value (which depends on the error rate used in the constructor), then the linear count is returned. When the linear count is returned there is no additional bias correction. For small ranges I suspect HLL uses the linear count far more than the svpcom implementation though I haven't tested this.

The problem of zero value registers causing a spike in the relative error near 5/2*m is something I've planned to address for awhile now. I'll try and get it fixed soon :) I'm closing this issue since the example code results in a linear count and isn't an example of zero value registers inflating the HyperLogLog estimate where the HyperLogLog estimate is used. For clarity I've created a separate issue to track this.

ascv / HyperLogLog

wrong cardinality estimates #16