Checking for null case should be done at the token bag level for the Jacard Similarity - Githubissues

OlivierBinette / StringCompare

Efficient String Comparison Functions and Fuzzy String Matching

https://olivierbinette.github.io/StringCompare/

17 stars 2 forks source link

Checking for null case should be done at the token bag level for the Jacard Similarity #24

Open OlivierBinette opened 2 years ago

OlivierBinette commented 2 years ago

The check for null case should be done at the token bag level rather than the string level:

https://github.com/OlivierBinette/StringCompare/blob/be58f4c1c9c24bc2cef5d9bb81053fa7ea003792/stringcompare/distance/jaccard.py#L17

I would recommend refactoring jaccard.py as follows:

Have the jacard() function take two token sets as arguments and compute their jaccard similarity (overlap percentage). Checking for empty token bags should be done here.
Have the compare() function deal with the tokenization and anything else (e.g. transforming the distance to a similarity).

OlivierBinette commented 2 years ago

Tagging @Garrett-Allen