MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

Cluster texts with similarity #5

Closed ariecattan closed 3 years ago

ariecattan commented 3 years ago

Additionally to the comparison of two list of strings, it would be great if we can cluster one list of strings - this can be useful for many tasks

MaartenGr commented 3 years ago

Fortunately, this is already implemented! If you compare the similarity between two strings with:

from polyfuzz import PolyFuzz
one_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
model = PolyFuzz("TF-IDF")
model.match(one_list, one_list)

You can then cluster the strings that the original strings were mapped to with (single linkage clustering):

model.group(link_min_similarity=0.75)

The resulting clusters can be accessed with:

model.get_clusters()

This will mean that some strings will not be clustered. This could be an advantage if you are looking for an approach in which you want to be relatively certain that clusters make sense.

ariecattan commented 3 years ago

I'm rather interested on clustering ALL the strings, in your example I'd like to get the cluster ["apple", "apples", "appl"] and not only ["apple", "apples"].

Is it not (yet) implemented, right ?

MaartenGr commented 3 years ago

Fair enough! You are right, this would only result in a smaller selection of strings to be clustered. A clustering function on a single list is currently not implemented.

I could change the group function such that you can specify whether you want the from_list to also be included when generating the clusters.

I'll look into that!

ariecattan commented 3 years ago

That would be great, thanks !!

MaartenGr commented 3 years ago

If you update to the newest version of PolyFuzz you can cluster one list of strings as follows:

from polyfuzz import PolyFuzz
one_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
model = PolyFuzz("TF-IDF")
model.match(one_list, one_list)

You can then cluster the strings that the original strings were mapped to with (single linkage clustering):

model.group(link_min_similarity=0.75, group_all_strings=True)

The resulting clusters can be accessed with:

model.get_clusters()

All words that you cannot find in model.get_clusters() are simply not clustered as no appropriate cluster could be found.

ariecattan commented 3 years ago

It works perfectly, your library is awesome, thanks a lot !!!

MaartenGr commented 3 years ago

No problem, glad I could be of help 🙂