Closed matrs closed 1 year ago
What happens here is that you have an input last called from_list
where you want each of those instances matches to any of the ones in the to_list
. In other words, you will have this pipeline:
from_list --> to_list
So when you have ["hola"], ["holas", "hola"] this is effectively doing the following:
["hola"] --> ["hola", "holas"]
As a result, it tries to match ["hola"] to one of the two other words and you will get only one mapping. If you have ["holas", "hola"], ["hola"] instead, then it tries to perform the following:
["holas", "hola"] -> ["hola"]
and both words will get matches to the single ["hola"].
If you want them all matched to one another, you can also input a single list with all of your words instead:
PolyFuzz('EditDistance').match(['holas', 'hola']).matches
Thanks for your answer. The thing is that I always have two lists and i don't need to compare members of the same list. I ended up writing a nested for loop and using fuzz.Wratio
from your RapidFuzz
library (I saw that's being used in polyfuzz EditDistance). It takes a fraction of a second for 80000 comparisons of short phrases (1-4 words).
The top_n
setting available for other metrics worked for me too, because I needed usually the top 3-4 matches and then I could filter by a threshold.
Hello,
The problem that I have, is that when comparing two lists of words,
PolyFuzz
returns a different result depending on the order of these input lists:So in the first case, I expected the
hola
from the first input, to match with both of the words in the second input, but it actually only returns the best match of each word in the first list against the second. This has as a consequence, that the order of the input lists matters, which was really unexpected for me. I don't know much about the world of text processing in general, so maybe this doesn't make much sense, but for the application I'm usingPolyFuzz
, to get the same output regardless of the order of the input lists really matters.PS: I just realized this, after a couple of weeks using this package and writing a handful of code, so the idea of using another package isn't very appealing, but maybe someone knows of a package for edit distance that behaves as I described it.
P2: I just realized that for
polyfuzz.models.TFIDF
i can set thetop_n
, but that isn't available for 'EditDistance'