MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

Cluster Strings #6

Closed MaartenGr closed 3 years ago

MaartenGr commented 3 years ago

Gives the ability to cluster one list of strings (#5) by following this pattern:

from polyfuzz import PolyFuzz
one_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
model = PolyFuzz("TF-IDF")
model.match(one_list, one_list)

You can then cluster the strings that the original strings were mapped to with (single linkage clustering):

model.group(link_min_similarity=0.75, group_all_strings=True)

The resulting clusters can be accessed with:

model.get_clusters()
MaartenGr commented 3 years ago

Removed python 3.6 from the workflow for now as Numpy drops that in a newer version and it tries grabbing the newer version although I specifically state it should grab an earlier version...