Closed itismewxg closed 9 years ago
to be more accurate, to ask the question is this way how could a python script intepret an HLL data struct byte array?
Hi, the C code compiles to a python module so you need to compile the module and install it. Once you do this you can then use it like any other python module. If you are using pip, you can simply do:
sudo pip install HLL
Alternatively, you can checkout the code and do:
sudo python setup install
If the module fails to compile it is probably b/c you are missing a dependency. Commonly this is the python development headers. If you are on Ubuntu/Mint you can do:
sudo apt-get install python-dev
I hope this answers your question. Let me know if I'm not understanding it correctly.
I get you, but I think with marshall, python just cannot interpret the HLL struct build by c++
Okay I understand your question now so I am re-opening this issue.
As a work around you can use registers() to get the registers and serialize that. Then when you need to get the object back, de-serialize the registers, make a HyperLogLog and use set_register() to set the values of the individual registers. It's not pretty but it should work.
edit: concision
Hi Joshua
First, thank you for providing this implementation of a very useful datastructure! However, I also have a question problems with serialization: can't get it to work, neither with pickle, cPickle or the registers method you kindly proposed.
I have listed my test code below:
# sudo pip install HLL
from HLL import HyperLogLog
import cPickle as pickle1
import pickle as pickle2
hll = HyperLogLog(12,123)
for i in range(1000000): hll.add(str(i))
# result is ~1 million
print hll.cardinality()
# this won't work, result is 0.0
hll2 = pickle1.loads(pickle1.dumps(hll))
print hll2.cardinality()
# this won't work, result is 0.0
hll2 = pickle2.loads(pickle2.dumps(hll))
print hll2.cardinality()
# this won't work
# "ValueError: Rank is greater than the maximum possible rank."
hll2 = HyperLogLog(12,hll.seed())
for i,r in enumerate(hll.registers()): hll.set_register(i,r)
Since the Rank check fails in the above code example, there must be an error.
Either there is an error in the add
function, where the rank is set (which is around 1); or in the set_register function, where the rank is checked to be below 16 (which is at 2). In any event, the code above sets some of the registers to values above 16.
I could read up on the HLL to help debugging :-)
I also have a question problems with serialization: can't get it to work, neither with pickle, cPickle or the registers method you kindly proposed
I'm adding support for pickling (it doesn't work currently) so the work around is temporary.
Since the Rank check fails in the above code example, there must be an error.
Good catch. There was a bug in the set_register function causing this (it is unrelated to pickling). I just pushed a fix.
I merged in the fix. Let me know if you still have any problems pickling. It should work with both pickle and cPickle. Thanks for the testing Pimin.
Actually, there is still a problem with pickle. The unpickled hll doesn't return the same cardinality, when the number of sampled elements is small. Test code:
# sudo pip install HLL
from HLL import HyperLogLog
import cPickle as pickle
n = 1
while n <= 1000000:
hll1 = HyperLogLog(10, 123)
for j in range(n): hll1.add(str(j))
hll2 = pickle.loads(pickle.dumps(hll1))
cardinality1 = hll1.cardinality()
cardinality2 = hll2.cardinality()
print ("n=%d" % n), ("and pickle worked" if cardinality1 == cardinality2 else "and pickle failed")
n *= 10
"""
n=1 and pickle failed
n=10 and pickle failed
n=100 and pickle failed
n=1000 and pickle failed
n=10000 and pickle worked
n=100000 and pickle worked
n=1000000 and pickle worked
"""
I'll take a look.
Thank you, your effort is appreciated!
The seed value was not being pickled. This is fixed on v.98. However this isn't why the cardinalities are wrong. If you run pkl.py on the v.98 branch you can see the problem:
__reduce__()
(<type 'HLL.HyperLogLog'>, (5, 123), '\x06\x01')
(<type 'HLL.HyperLogLog'>, (5, 123), '\x06\x01')
True
registers:
[6, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 6, 3, 0, 0, 0, 0, 0, 0, 0]
[6, 1, 0, 117, 108, 101, 115, 0, 0, 111, 103, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
False
cardinalities:
17.2981882036
13.2811999769
False
When Py_BuildValue("s", ...) encounters a register set to zero it interprets it as the terminating null character so it doesn't get all the registers e.g. '\x00' should follow '\x01'. Larger n tend to work b/c none of the registers are zero so all the registers get copied. Py_BuildValue("s#", ...) doesn't work since pickle protocol 2 does not allow null bytes in strings.
Fix is on v.98.
v.98 merged into master.
Cool, thanks! My tests pass now.
the HLL struct is construct in c++ and record in the data stream, made a python script read this the data and decode out the cardinality of the hll data, is there any interface to do this?