dnbaker / sketch

C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings
MIT License
154 stars 13 forks source link

SetSketches saved from different processes have jaccard estimation of 0 #74

Open nvanva opened 1 year ago

nvanva commented 1 year ago

Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases. Here is a minimal example showing this:

import os
import sketch

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 1000, 1000
for i in range(step1, maxval1+1, step1):
    hll.addh(str(i))

for i in range(step2, maxval2+1, step2):
    hll2.addh(str(i))

hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

Run this code twice in 2 different process. Then run:

from pathlib import Path
for ss1 in Path('./').glob('tmp1_*'):
    for ss2 in Path('./').glob('tmp2_*'):
        hll, hll2 = sketch.setsketch.CSetSketch(str(ss1)), sketch.setsketch.CSetSketch(str(ss2))
        jaccard_est = sketch.setsketch.jaccard_index(hll, hll2)
        print(ss1, ss2, jaccard_est)

It will print this in my case: tmp1_736949 tmp2_736949 0.16761398315429688 tmp1_736949 tmp2_736999 0.0 tmp1_736999 tmp2_736949 0.0 tmp1_736999 tmp2_736999 0.166900634765625

nvanva commented 1 year ago

The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem.

%%timeit
import os
import sketch
import mmh3

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000
for i in range(step1, maxval1+1, step1):
#     o = str(i)
    o,_ = mmh3.hash64(str(i), seed=0, signed=False,)
    hll.add(o)

for i in range(step2, maxval2+1, step2):
#     o = str(i)
    o, _ = mmh3.hash64(str(i), seed=0, signed=False,)
    hll2.add(o)

hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed?

dnbaker commented 1 year ago

Hi!

That's right, the default addh uses Python's hash function which is seeded at each invocation. I should modify this to be consistent.

In the short term, you can convert to a numpy array and hash it (from_np and from_shs should provide this). from_shs expects already hashed data (no hashing), and from_np hashes. I just used python's hash so it could work on any python object.

I'll let you know when it's updates. Thanks for the issue!

Best,

Daniel

On Sunday, September 24, 2023, Nikolay Arefyev @.***> wrote:

The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem.

%%timeit import os import sketch import mmh3

m = 2**18 hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000 for i in range(step1, maxval1+1, step1):

o = str(i)

o,_ = mmh3.hash64(str(i), seed=0, signed=False,)
hll.add(o)

for i in range(step2, maxval2+1, step2):

o = str(i)

o, _ = mmh3.hash64(str(i), seed=0, signed=False,)
hll2.add(o)

hll.write(f'tmp1{os.getpid()}') hll2.write(f'tmp2{os.getpid()}')

The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed?

— Reply to this email directly, view it on GitHub https://github.com/dnbaker/sketch/issues/74#issuecomment-1732662604, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ5UVJVSEOYAH2N7UX2TY3X4CJ6HANCNFSM6AAAAAA5CJSOLM . You are receiving this because you are subscribed to this thread.Message ID: @.***>