Closed ascv closed 8 years ago
Looking forward to this!
As a temporary workaround I'm using svpcom's python implementation to do the final estimation:
import HLL
from random import choice
#Sample 5000 numbers from [0..1000], i.e. ~20% uniques
possible = range(1000)
sample = [choice(possible) for _ in range(5000)] #sample with repetition
print ("True cardinality: {}".format(len(set(sample))))
hll = HLL.HyperLogLog(14)
for s in sample:
hll.add(str(s))
print ("Estimated cardinality (C): {}".format(hll.cardinality()))
import hyperloglog #svpcom implementation
def cardinality(hll_c):
hll_python = hyperloglog.HyperLogLog(0.01)#uses error_rate instead of p 14 = error_rate 0.01
#p = int(math.ceil(math.log((1.04 / error_rate) ** 2, 2))) should inverse this to get error_rate from p, and p from hll_c.size
hll_python.M = struct.unpack_from("B"*hll_c.size(),hll_c.registers())
return hll_python.card()
print ("Estimated cardinality (Python): {}".format(cardinality(hll)))
True cardinality: 991
Estimated cardinality (C): 1436.4178401346198
Estimated cardinality (Python): 995.648975995318
This gives me fast merging and adding from your implementation with the more accurate HLL++ bias correction.
Any chances for solving this issue?
We use your library in our project with cardinality function fix by B3AU (using svpcom's library for cardinality estimating). But this solution is pretty slow for our needs (cause svpcom's class instantition and struct unpacking are too slow).
The problem is when you have small cardinalities (like less than about 35k, in real life 50k), the bias is too big, like around 44-45%. I'm trying to rewrite your estimate function, just like it has to be in HLL++ Google's paper and svpcom's library but it's all too complicated for me, so I've got nothing as for now.
We look forward for this fix and looking for some another solution, although, there is very small number of HLL libraries (for all the languages) in the Internet.
Thank you so much for your work!!
The problem is when you have small cardinalities (like less than about 35k, in real life 50k), the bias is too big, like around 44-45%.
Hmmm. When you are having this problem, can you do:
print map(int, my_hll.registers())
and paste the results in this thread (or attach a file of the output) along with the expected (true) cardinality?
Sure!
The true cardinality of the set is 1649. (1) Svpcom's HLL gives 1650, so, error_rate is 0.0606... (2) Your HLL gives 2392, error rate is: 45.0576... (3) Your HLL with SVPCom's cardinality estimation gives 1658, error rate is just 0.5458...
Please note that svpcom's hll uses sha1 hashing instead of murmur32, so the results (1) and (3) are never the same (because registers were filled differently), but both still pretty correct.
Here is the file attached ascv_hll_registers1.txt
Thanks for posting the debugging info. It was very helpful.
When you have a large number of zero valued registers, the algorithm reverts back to doing a simple linear count. This is what was happening in your case. When I tried verifying the linear count I discovered a bug that was causing the linear count to be off. The fix is on the v1.1 branch. Let me know if you are still having problems on that branch.
On the v1.1 branch:
from HLL import HyperLogLog
import json
f = open('ascv_hll_registers1.txt')
abzac_regs = json.loads(f.read())
hll = HyperLogLog(14)
hll.set_registers(bytearray(abzac_regs))
print hll.cardinality() # prints 1658.14611035
Thank you so much again!
This fix really works! Would you update your lib on pypi?
I updated the library on pypi so it should be available through pip now. Thanks for testing and letting me know it is working okay for you.
I edited this issue to reflect the bug in the small range correction and I am closing. I don't want users conflating the bug in the small range correction with bias correction. B3AU's work around should no longer be necessary.
svpcom's implementation does some extra correction to get a smoother transition from linear counting to hll.
Here's a plot for k=12
The small range correction is returning estimates that are abnormally off for a linear count.