MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

Grouping issue when TFIDF.min_similarity < link_min_similarity #40

Open colasri opened 2 years ago

colasri commented 2 years ago

In the code below (with output in attached picture) I perform a simple TFIDF matching of ["apple", "apples", "appl", "recal", "happy"].

The initial min_similarity is set to 0.2. The similarity of happy and appl is 0.24.

When grouping with a link_min_similarity of 0.5, happy should not belong in the apples group, though that's what happens in the output of .get_matches(), it is in the apples group.

It appears it is not in the cluster though.

grouping

Plain text code:

from polyfuzz import PolyFuzz
from polyfuzz.models import TFIDF

from_list = ["apple", "apples", "appl", "recal", "happy"]
matcher = TFIDF(min_similarity=0.2)
model = PolyFuzz(matcher).match(from_list)
cm = model.cluster_mappings
model.group(link_min_similarity=0.5, group_all_strings=True)
print(model.get_matches())
MaartenGr commented 2 years ago

I am not entirely sure but there seems to be an issue with the group_all_strings parameter combined with link_min_similarity. What most likely is happening is that (appl, apple) gets into the cluster apples and (happy, appl) gets into the same cluster because it shared appl. I'll have to dig a little deeper to figure this stuff out but I'll make sure it gets released in the next version!