MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

For match, order of input lists matters (why?) #52

Closed matrs closed 1 year ago

matrs commented 1 year ago

Hello,
The problem that I have, is that when comparing two lists of words, PolyFuzz returns a different result depending on the order of these input lists:

In [6]: PolyFuzz('EditDistance').match(['hola'], ['holas', 'hola']).matches
Out[6]: 
{'EditDistance':    From    To  Similarity
 0  hola  hola         1.0}

In [7]: PolyFuzz('EditDistance').match(['holas', 'hola'],['hola']).matches
Out[7]: 
{'EditDistance':     From    To  Similarity
 0  holas  hola    0.888889
 1   hola  hola    1.000000}

So in the first case, I expected the hola from the first input, to match with both of the words in the second input, but it actually only returns the best match of each word in the first list against the second. This has as a consequence, that the order of the input lists matters, which was really unexpected for me. I don't know much about the world of text processing in general, so maybe this doesn't make much sense, but for the application I'm using PolyFuzz, to get the same output regardless of the order of the input lists really matters.

PS: I just realized this, after a couple of weeks using this package and writing a handful of code, so the idea of using another package isn't very appealing, but maybe someone knows of a package for edit distance that behaves as I described it.

P2: I just realized that for polyfuzz.models.TFIDF i can set the top_n, but that isn't available for 'EditDistance'

MaartenGr commented 1 year ago

What happens here is that you have an input last called from_list where you want each of those instances matches to any of the ones in the to_list. In other words, you will have this pipeline:

from_list --> to_list

So when you have ["hola"], ["holas", "hola"] this is effectively doing the following:

["hola"] --> ["hola", "holas"]

As a result, it tries to match ["hola"] to one of the two other words and you will get only one mapping. If you have ["holas", "hola"], ["hola"] instead, then it tries to perform the following:

["holas", "hola"] -> ["hola"]

and both words will get matches to the single ["hola"].

If you want them all matched to one another, you can also input a single list with all of your words instead:

PolyFuzz('EditDistance').match(['holas', 'hola']).matches

matrs commented 1 year ago

Thanks for your answer. The thing is that I always have two lists and i don't need to compare members of the same list. I ended up writing a nested for loop and using fuzz.Wratio from your RapidFuzz library (I saw that's being used in polyfuzz EditDistance). It takes a fraction of a second for 80000 comparisons of short phrases (1-4 words). The top_n setting available for other metrics worked for me too, because I needed usually the top 3-4 matches and then I could filter by a threshold.