beniaminogreen / zoomerjoin

Superlatively-fast fuzzy-joins in R
https://beniamino.org/zoomerjoin/
GNU General Public License v3.0
103 stars 5 forks source link

[FR] Add support for cosine and hamming distances #97

Open beniaminogreen opened 9 months ago

beniaminogreen commented 9 months ago

Now that the package is getting more mature, it would be nice to add support for other distance metrics (specifically, hamming and cosine distances). These should be relatively easy to implement following these notes, and will also provide the opportunity to refine some of the code. In the future, I would like the hash families to implement an lsh trait, which would allow us to reuse some of the amplifying code.

etiennebacher commented 9 months ago

Hi @beniaminogreen, would it also make sense to support the Jaro-Winkler similarity? If so, maybe I could open a separate issue?

beniaminogreen commented 8 months ago

Working on adding the hamming distance + associated documentation this week. I have to check, but I don't think that anyone has discovered a Locality Sensitive Hash for the Jaro-Winkler similarity, so that specific distance metric would be difficult to implement. The closest I can find is this paper for the Levenshtein distance.

The LSH method described in the paper is more complex and, on skimming the paper, it looks like the theoretical guarantees of the method assume that the sets of the strings you have to match are all the same length. I will have to read the paper / research more thoroughly to understand whether this is actually a limitation of the method (which might be an issue for some users), or if the paper describes a way to join strings of different length.

etiennebacher commented 8 months ago

Thanks for the info, it's already good to know that there's no "standard" way to mix J-W with the LSH