Grouping issue when TFIDF.min_similarity < link_min_similarity

MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.

MIT License

725 stars 68 forks source link

In the code below (with output in attached picture) I perform a simple TFIDF matching of ["apple", "apples", "appl", "recal", "happy"].

The initial min_similarity is set to 0.2. The similarity of happy and appl is 0.24.

When grouping with a link_min_similarity of 0.5, happy should not belong in the apples group, though that's what happens in the output of .get_matches(), it is in the apples group.

It appears it is not in the cluster though.

grouping

Plain text code:

from polyfuzz import PolyFuzz
from polyfuzz.models import TFIDF

from_list = ["apple", "apples", "appl", "recal", "happy"]
matcher = TFIDF(min_similarity=0.2)
model = PolyFuzz(matcher).match(from_list)
cm = model.cluster_mappings
model.group(link_min_similarity=0.5, group_all_strings=True)
print(model.get_matches())

MaartenGr / PolyFuzz

Grouping issue when TFIDF.min_similarity < link_min_similarity #40