ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.59k stars 295 forks source link

Why can't got the top k ? #26

Closed Alisaincd closed 7 years ago

Alisaincd commented 7 years ago

Hi, I want to got the top k element with MinHashLSH but failed. For example, I set 'k=3', but I got ('result: ', ['21', '28', '51', '1', '82', '3', '91', '69', '86', '85']), whose length is larger than 3. My demo is like below: def query_topk(l, query_doc, k): forest= MinHashLSHForest(num_perm=256) count=0 for i in l: forest.add(str(count), i) count += 1 forest.index() result = forest.query(query_doc, k) return result

l : list of MinHash, query_doc: a MinHash Is there anything wrong? By the way, does the input must be a list of string? What if my input is a vector? Thanks for your patience, And another question, does this realization just support for texts? if each of my input is a list of float, i.e.[[1,2,3],[1.2,2.3,2.1]], can this work perfectly?

Sincerely,

ekzhu commented 7 years ago

Thanks for raising the issue. I just fixed it in 1.2.1.

For your question. MinHash supports bytes as input. So as long as you can convert the object (i.e., integers, strings, floats, lists) into bytes, it works with MinHash. For example:

# For a set of floats, e.g. {1.3, 123.4, 32.9, 3.1415926, ...}
minhash.update(struct.pack("f", 3.1415926))

# EVERY ELEMENT in your input set is a LIST of float
# e.g. {[1.34, 1.3, 343.0, 123.9], [2.3, 23.2, 86.8], ...}
minhash.update(struct.pack("4f", *[1.34, 1.3, 343.0, 123.9]))