ascv / HyperLogLog

Fast HyperLogLog for Python.
MIT License
99 stars 19 forks source link

Incorrect cardinality after merge #25

Closed Guangyi-Z closed 6 years ago

Guangyi-Z commented 6 years ago
hll1 = HyperLogLog(16)
hll2 = HyperLogLog(16)
for i in range(100000):
    hll1.add(str(random.random()))
for i in range(100000):
    hll2.add(str(random.random()))
for i in range(100000):
    r = str(random.random())
    hll1.add(r)
    hll2.add(r)

print(hll1.cardinality())
print(hll2.cardinality())

hll1.merge(hll2)
print(hll1.cardinality())

gives

201675.90068912748
201259.4273446178
201675.90068912748

the last one should be ~300k.

ascv commented 6 years ago

It looks like hll1 is using the pre-merge cached cardinality. I put a fix on the v1.2.3 branch. I will get it pushed out later.

Nice catch.

Guangyi-Z commented 6 years ago

Do I have any workarounds before the fix?

ascv commented 6 years ago

Fix is live and should be available through pip. Let me know if you have any issues. Happy holidays!

ascv commented 6 years ago

I am still getting the issue on a machine I installed the package through pip. I am looking into it.

ascv commented 6 years ago

This turned out to be an issue local to my machine.