ascv / HyperLogLog

Fast HyperLogLog for Python.
MIT License
99 stars 19 forks source link

use python hll to get the HLL struct made by c++ #12

Closed itismewxg closed 9 years ago

itismewxg commented 9 years ago

the HLL struct is construct in c++ and record in the data stream, made a python script read this the data and decode out the cardinality of the hll data, is there any interface to do this?

itismewxg commented 9 years ago

to be more accurate, to ask the question is this way how could a python script intepret an HLL data struct byte array?

ascv commented 9 years ago

Hi, the C code compiles to a python module so you need to compile the module and install it. Once you do this you can then use it like any other python module. If you are using pip, you can simply do:

sudo pip install HLL

Alternatively, you can checkout the code and do:

sudo python setup install

If the module fails to compile it is probably b/c you are missing a dependency. Commonly this is the python development headers. If you are on Ubuntu/Mint you can do:

sudo apt-get install python-dev
ascv commented 9 years ago

I hope this answers your question. Let me know if I'm not understanding it correctly.

itismewxg commented 9 years ago

I get you, but I think with marshall, python just cannot interpret the HLL struct build by c++

ascv commented 9 years ago

Okay I understand your question now so I am re-opening this issue.

As a work around you can use registers() to get the registers and serialize that. Then when you need to get the object back, de-serialize the registers, make a HyperLogLog and use set_register() to set the values of the individual registers. It's not pretty but it should work.

edit: concision

skipperkongen commented 9 years ago

Hi Joshua

First, thank you for providing this implementation of a very useful datastructure! However, I also have a question problems with serialization: can't get it to work, neither with pickle, cPickle or the registers method you kindly proposed.

I have listed my test code below:

# sudo pip install HLL
from HLL import HyperLogLog
import cPickle as pickle1
import pickle as pickle2

hll = HyperLogLog(12,123)
for i in range(1000000): hll.add(str(i))
# result is ~1 million
print hll.cardinality()

# this won't work, result is 0.0
hll2 = pickle1.loads(pickle1.dumps(hll))
print hll2.cardinality()

# this won't work, result is 0.0
hll2 = pickle2.loads(pickle2.dumps(hll))
print hll2.cardinality()

# this won't work 
# "ValueError: Rank is greater than the maximum possible rank."
hll2 = HyperLogLog(12,hll.seed())
for i,r in enumerate(hll.registers()): hll.set_register(i,r)
skipperkongen commented 9 years ago

Since the Rank check fails in the above code example, there must be an error.

Either there is an error in the add function, where the rank is set (which is around 1); or in the set_register function, where the rank is checked to be below 16 (which is at 2). In any event, the code above sets some of the registers to values above 16.

I could read up on the HLL to help debugging :-)

ascv commented 9 years ago

I also have a question problems with serialization: can't get it to work, neither with pickle, cPickle or the registers method you kindly proposed

I'm adding support for pickling (it doesn't work currently) so the work around is temporary.

Since the Rank check fails in the above code example, there must be an error.

Good catch. There was a bug in the set_register function causing this (it is unrelated to pickling). I just pushed a fix.

ascv commented 9 years ago

I merged in the fix. Let me know if you still have any problems pickling. It should work with both pickle and cPickle. Thanks for the testing Pimin.

skipperkongen commented 9 years ago

Actually, there is still a problem with pickle. The unpickled hll doesn't return the same cardinality, when the number of sampled elements is small. Test code:

# sudo pip install HLL
from HLL import HyperLogLog
import cPickle as pickle

n = 1

while n <= 1000000:

    hll1 = HyperLogLog(10, 123)
    for j in range(n): hll1.add(str(j))

    hll2 = pickle.loads(pickle.dumps(hll1))

    cardinality1 = hll1.cardinality()
    cardinality2 = hll2.cardinality()
    print ("n=%d" % n), ("and pickle worked" if cardinality1 == cardinality2 else "and pickle failed")

    n *= 10

"""
n=1 and pickle failed
n=10 and pickle failed
n=100 and pickle failed
n=1000 and pickle failed
n=10000 and pickle worked
n=100000 and pickle worked
n=1000000 and pickle worked
"""
ascv commented 9 years ago

I'll take a look.

skipperkongen commented 9 years ago

Thank you, your effort is appreciated!

ascv commented 9 years ago

The seed value was not being pickled. This is fixed on v.98. However this isn't why the cardinalities are wrong. If you run pkl.py on the v.98 branch you can see the problem:

__reduce__()
(<type 'HLL.HyperLogLog'>, (5, 123), '\x06\x01')
(<type 'HLL.HyperLogLog'>, (5, 123), '\x06\x01')
True

registers:
[6, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 6, 3, 0, 0, 0, 0, 0, 0, 0]
[6, 1, 0, 117, 108, 101, 115, 0, 0, 111, 103, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
False

cardinalities:
17.2981882036
13.2811999769
False

When Py_BuildValue("s", ...) encounters a register set to zero it interprets it as the terminating null character so it doesn't get all the registers e.g. '\x00' should follow '\x01'. Larger n tend to work b/c none of the registers are zero so all the registers get copied. Py_BuildValue("s#", ...) doesn't work since pickle protocol 2 does not allow null bytes in strings.

ascv commented 9 years ago

Fix is on v.98.

ascv commented 9 years ago

v.98 merged into master.

skipperkongen commented 9 years ago

Cool, thanks! My tests pass now.