ekzhu / SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop
Apache License 2.0
589 stars 40 forks source link

Unexpected multiset behaviour #17

Closed kevdur closed 2 years ago

kevdur commented 2 years ago

Firstly thanks for this package and the datasketch one—they're both great.

I noticed some unexpected behaviour when using the all_pairs function with input data that aren't set-like:

sets = [["a", "b"], ["a", "a"]]
list(all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.1))
# [(1, 0, 1.0)]

sets = [["a", "a", "b"], ["a", "a"]]
list(all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.1))
# [(1, 0, 1.5)]

I assume that this package doesn't support multisets, and that the outputs in such cases are undefined (setting the threshold to 0.75 in the second example leads to an empty result set, for instance), but if that's the case perhaps it would be a good idea to make this explicit in the documentation, and to mention that it's the user's responsibility to ensure that there are no duplicates in their input sets/lists.

In my case this simply means that I have to convert my lists to sets before passing them to all_pairs, but it did catch me off guard because that step wouldn't be necessary if I were applying MinHash LSH.

ekzhu commented 2 years ago

Thanks for pointing this out. I will update the documentation.