Center-for-Research-Libraries / crl-serials-validator

Validate bibliographic and holdings data for shared print.
GNU General Public License v3.0
0 stars 1 forks source link

Drop dependency on fuzzywuzzy #18

Closed nflorin closed 2 years ago

nflorin commented 2 years ago

Basically, the name is problematic and removing it is probably best in the long run.

nflorin commented 2 years ago

The validator uses fuzzywuzzy to perform fuzzy string matching on titles. The easiest way to replace it would be to switch the dependency to rapidfuzz, which is generally better than fuzzywuzzy -- faster, and also includes more string matching options. However, to use rapidfuzz on Windows you have to install the Visual C++ 2019 redistributable, which I think makes it a no go. We don't want to introduce dependencies on Windows system installs.

fuzzywuzzy matches titles based on Levenshtein distances. The Levenshtein library does this (fuzzywuzzy depends on it), so the full title match can just be done with that. The tricky part will be the "partial fuzz", matching the cores of two title strings, so that we don't say that titles are mismatched just because one is called "Journal of Something" and the other is called "Journal of Something: the publication of the American Academy of Something". This can be done, it will only take a bit to get it exactly right.

nflorin commented 2 years ago

Ha, the author of the project beat me to it. I thought to look at its GitHub page and discovered that it's now called TheFuzz. So I should just be able to swap out an import statement and all will be good.