Open nvanva opened 1 year ago
The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem.
%%timeit
import os
import sketch
import mmh3
m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)
step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000
for i in range(step1, maxval1+1, step1):
# o = str(i)
o,_ = mmh3.hash64(str(i), seed=0, signed=False,)
hll.add(o)
for i in range(step2, maxval2+1, step2):
# o = str(i)
o, _ = mmh3.hash64(str(i), seed=0, signed=False,)
hll2.add(o)
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')
The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed?
Hi!
That's right, the default addh uses Python's hash function which is seeded at each invocation. I should modify this to be consistent.
In the short term, you can convert to a numpy array and hash it (from_np and from_shs should provide this). from_shs expects already hashed data (no hashing), and from_np hashes. I just used python's hash so it could work on any python object.
I'll let you know when it's updates. Thanks for the issue!
Best,
Daniel
On Sunday, September 24, 2023, Nikolay Arefyev @.***> wrote:
The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem.
%%timeit import os import sketch import mmh3
m = 2**18 hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)
step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000 for i in range(step1, maxval1+1, step1):
o = str(i)
o,_ = mmh3.hash64(str(i), seed=0, signed=False,) hll.add(o)
for i in range(step2, maxval2+1, step2):
o = str(i)
o, _ = mmh3.hash64(str(i), seed=0, signed=False,) hll2.add(o)
hll.write(f'tmp1{os.getpid()}') hll2.write(f'tmp2{os.getpid()}')
The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed?
— Reply to this email directly, view it on GitHub https://github.com/dnbaker/sketch/issues/74#issuecomment-1732662604, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ5UVJVSEOYAH2N7UX2TY3X4CJ6HANCNFSM6AAAAAA5CJSOLM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases. Here is a minimal example showing this:
Run this code twice in 2 different process. Then run:
It will print this in my case: tmp1_736949 tmp2_736949 0.16761398315429688 tmp1_736949 tmp2_736999 0.0 tmp1_736999 tmp2_736949 0.0 tmp1_736999 tmp2_736999 0.166900634765625