MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

TFIDF min_similarity not applied #49

Open philkoch opened 1 year ago

philkoch commented 1 year ago

When using the TFIDF model the min_similiary parameter seems not to be applied to the results.

Minimal Example that reproduces the problem (polyfuzz 0.4.0):

from polyfuzz import PolyFuzz
from polyfuzz.models import TFIDF

if __name__ == "__main__":
    token_list = [
        "Stoltenbergs",
        "Ansage",
        "Putin",
        "Nato",
        "Drohungen",
        "Russlands",
        "Nato",
        "Unterstützung",
        "Ukraine",
        "Stoltenberg",
        "Putin",
        "Nato",
    ]

    matcher = TFIDF(n_gram_range=(3, 3), min_similarity=0.9)
    model = PolyFuzz(matcher)
    model.match(token_list)
    model.group()
    matches = model.get_matches()
    print(matches)

When running the code the following output is generated, but the rows 4 and 7 should have a Similarity score of 0, if I understand the documentation correctly.

The minimum similarity between strings, otherwise return 0 similarity

I would expect the rows with a Similarity of < 0.9 to have a Similarity of 0 and a To value of None.

Output:

             From             To  Similarity          Group
0    Stoltenbergs    Stoltenberg       0.932   Stoltenbergs
1          Ansage           None       0.000           None
2           Putin          Putin       1.000          Putin
3            Nato           Nato       1.000           Nato
4       Drohungen  Unterstützung       0.091  Unterstützung
5       Russlands           None       0.000           None
6            Nato           Nato       1.000           Nato
7   Unterstützung      Drohungen       0.091      Drohungen
8         Ukraine           None       0.000           None
9     Stoltenberg   Stoltenbergs       0.932   Stoltenbergs
10          Putin          Putin       1.000          Putin
11           Nato           Nato       1.000           Nato

In case I'm using the library wrong, how would I be able to get only results with a similarity higher than 0.9?

MaartenGr commented 1 year ago

You are using the library correctly but it seems that the min_similarity was not implemented properly for all cosine similarity backends. I will make sure this gets fixed a next release. For now, if you want to use this feature, you can do it with:

pip install polyfuzz[fast]

philkoch commented 1 year ago

I will try that, thanks for the quick response!

nitindabadghav commented 2 months ago

Hello Maarten, Whichever model I use with Polyfuzz, the model parameters are never applied. Is there any workaround for this ?

Thanks, Nitin

MaartenGr commented 2 months ago

@nitindabadghav Could you provide a bit more information? What version do you use? Can you share your code? Have you tried the answer I provided above? Etc.