Open beniaminogreen opened 9 months ago
Hi @beniaminogreen, would it also make sense to support the Jaro-Winkler similarity? If so, maybe I could open a separate issue?
Working on adding the hamming distance + associated documentation this week. I have to check, but I don't think that anyone has discovered a Locality Sensitive Hash for the Jaro-Winkler similarity, so that specific distance metric would be difficult to implement. The closest I can find is this paper for the Levenshtein distance.
The LSH method described in the paper is more complex and, on skimming the paper, it looks like the theoretical guarantees of the method assume that the sets of the strings you have to match are all the same length. I will have to read the paper / research more thoroughly to understand whether this is actually a limitation of the method (which might be an issue for some users), or if the paper describes a way to join strings of different length.
Thanks for the info, it's already good to know that there's no "standard" way to mix J-W with the LSH
Now that the package is getting more mature, it would be nice to add support for other distance metrics (specifically, hamming and cosine distances). These should be relatively easy to implement following these notes, and will also provide the opportunity to refine some of the code. In the future, I would like the hash families to implement an
lsh
trait, which would allow us to reuse some of the amplifying code.