MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

Unexpected output #29

Closed jpforny closed 2 years ago

jpforny commented 2 years ago

Hi, @MaartenGr

Thank you for this great library! It's really well engineered.

Today I noticed some weird behavior. It seems that in some cases, the model returns only inexact matches. I was able to reproduce it with the following code:

from polyfuzz import PolyFuzz

from_list = ["apple", "apples"]
to_list = ["apple", "apples"]

model = PolyFuzz("TF-IDF")
model.match(from_list, to_list)
model.get_matches()
From To Similarity
apple apples 0.754
apples apple 0.754

If a third element is added in from_list or to_list, it works as expected.

from polyfuzz import PolyFuzz

from_list = ["apple", "apples"]
to_list = ["apple", "apples", "test"]

model = PolyFuzz("TF-IDF")
model.match(from_list, to_list)
model.get_matches()
From To Similarity
apple apple 1.0
apples apples 1.0

Am I missing something?

MaartenGr commented 2 years ago

This happens when you pass two exact same lists. If you have two of the same lists, then there is no need to apply any distance metric since it will just map to itself. Thus, we ignore the "to itself-mapping" and check for all other possibilities. Hence, you get different mappings in the first example you show (using the same from_list as the to_list) compared to the second example you show (using a different from_list compared to the to_list).

jpforny commented 2 years ago

Understood.

Thank you!