ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.56k stars 293 forks source link

weighted min hash - minhash_many function #195

Open dopc opened 1 year ago

dopc commented 1 year ago

hey, thanks for this great project. I want to use min hash for my text embedding vectors which have both negative and positive numbers. I have searched the issues and found that weighted min hash can be used for that. I tried it and it actually works we.

my problem is about minhash_many function. its result is different than minhash function. below is a minimal code to reproduce and a screenshot to demonstrate without running the code.

I want to use minhash_many since it is faster than for loop. So is this normal or something unexpected. thx.

from time import perf_counter as pc
from datasketch import WeightedMinHashGenerator

vectors = np.random.uniform(-1, 1, (20000, 100))

mg = WeightedMinHashGenerator(vectors.shape[1], 32)
t0 = pc()
many_result = np.array(list(map(lambda x: x.digest(), mg.minhash_many(vectors))))
print(f'shape many: {many_result.shape}')
print(f'time many: {pc()-t0:.3f}')
print(f'many_result[0][:10]:\n{many_result[0][:10]}\n')

t0 = pc()
for_result = np.array(list(map(lambda x: mg.minhash(x).digest(), vectors)))
print(f'shape for: {many_result.shape}')
print(f'time for: {pc()-t0:.3f}')
print(f'for_result[0][:10]:\n{for_result[0][:10]}')

image

ekzhu commented 1 year ago

@jroose-jv is this an expected behavior? My understanding is that minhash_many is a batch version of minhash.

ekzhu commented 1 year ago

Sorry for the late response. If you want consistency across all weighted minhash, I recommend picking either minhash or minhash_many but not both.

dopc commented 1 year ago

I want to use minhash_many, but its result does not have any meaning, as far as I understand. In above, I used both of them to show the difference between them.