ascv / HyperLogLog

Fast HyperLogLog for Python.
MIT License
99 stars 19 forks source link

Serialization/deserialization of HyperLogLog objects leads to big error gap #46

Closed Squalene closed 3 months ago

Squalene commented 1 year ago

Hi,

First of all, thank you very much for this implementation. While playing with the library, I found out that serializing and deserializing a HyperLogLog object and then merging it to another leads to a big drop in accuracy. Here is the code to reproduce:

Python: 3.9.16 HLL: 2.0.3

from HLL import HyperLogLog
import random 
import pickle 
random.seed(0)
def test_union_precision(serde=False):
    union_count = 1000
    candidate_values = [str(i) for i in range(100_000)]
    picked_values = set()
    agg_hll = HyperLogLog(p=8, seed = 0)
    for _ in range(union_count):
        hll = HyperLogLog(p=8, seed = 0)
        values = random.sample(candidate_values, k=random.randint(0, 100))
        picked_values.update(values)
        for v in values:
            hll.add(v)
        if(serde):
            hll = pickle.loads(pickle.dumps(hll))
        agg_hll.merge(hll)

    deviation = agg_hll.cardinality()/len(picked_values)
    return deviation

print(test_union_precision(serde=False), test_union_precision(serde=True))

gives

1.048  0.130

I have seen the issues resolved previously and indeed my registers are all the same before and after serialization/deserialization so I suspect the error to be somewhere else but I am not familiar enough with the codebase to find it.

Thank you in advance for your help

ascv commented 1 year ago

Thanks for reporting this. I suspect this is related to serialization/deserialization of the registers when in sparse representation. I will investigate. As a temporary fix, you can do sparse=False in the HyperLogLog constructor e.g. hll = HyperLogLog(p=8, seed=0, sparse=False).

Squalene commented 1 year ago

This indeed solves the issue, thank you.

ascv commented 3 months ago

This should be fixed on #47.