Closed B3AU closed 9 years ago
Seems like the problem comes from having a small cardinality compared to k, resulting in many zero registers and a bad estimate. The svpcom pure python implementation adds a bias correction to solve this problem.
Am I missing something?
Yes. In your example linear counting is being used since the raw estimate is less than 5/2*m. Linear counting estimates the cardinality using
m*log(m/V)
where m=number of registers and V=number of zero registers. You can see this by modifying your example:
regs = hll.registers()
print sum(map(lambda x: 1 if x == 0 else 0, regs))
print "Estimated cardinality: {}".format(hll.cardinality())
In my example I get the output:
True cardinality: 996
15413
Estimated cardinality: 1444.08451191
Note that
2^14*log2(2^14/15413) = 1444.085
The svpcom implementation handles small ranges (mostly) the same:
#count number or registers equal to 0
V = self.M.count(0)
if V > 0:
H = self.m * math.log(self.m / float(V))
return H if H <= get_treshold(self.p) else self._Ep()
else:
return self._Ep()
As you can see, when V > 0 a linear count is performed. If the linear count estimate is less than the threshold value (which depends on the error rate used in the constructor), then the linear count is returned. When the linear count is returned there is no additional bias correction. For small ranges I suspect HLL uses the linear count far more than the svpcom implementation though I haven't tested this.
The problem of zero value registers causing a spike in the relative error near 5/2*m is something I've planned to address for awhile now. I'll try and get it fixed soon :) I'm closing this issue since the example code results in a linear count and isn't an example of zero value registers inflating the HyperLogLog estimate where the HyperLogLog estimate is used. For clarity I've created a separate issue to track this.
I'm getting getting estimates which are way off. Here's some simple code to show the problem:
which outputs:
Am I missing something? This seems to be basic functionality. Otherwise, the library works great. Super fast compared to https://github.com/svpcom/hyperloglog (which does give correct estimates)