I assume that this package doesn't support multisets, and that the outputs in such cases are undefined (setting the threshold to 0.75 in the second example leads to an empty result set, for instance), but if that's the case perhaps it would be a good idea to make this explicit in the documentation, and to mention that it's the user's responsibility to ensure that there are no duplicates in their input sets/lists.
In my case this simply means that I have to convert my lists to sets before passing them to all_pairs, but it did catch me off guard because that step wouldn't be necessary if I were applying MinHash LSH.
Firstly thanks for this package and the datasketch one—they're both great.
I noticed some unexpected behaviour when using the
all_pairs
function with input data that aren't set-like:I assume that this package doesn't support multisets, and that the outputs in such cases are undefined (setting the threshold to 0.75 in the second example leads to an empty result set, for instance), but if that's the case perhaps it would be a good idea to make this explicit in the documentation, and to mention that it's the user's responsibility to ensure that there are no duplicates in their input sets/lists.
In my case this simply means that I have to convert my lists to sets before passing them to
all_pairs
, but it did catch me off guard because that step wouldn't be necessary if I were applying MinHash LSH.