MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

I don't understand what is returned #73

Open raffaem opened 7 months ago

raffaem commented 7 months ago

The docs provide the following example:

from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

model = PolyFuzz("TF-IDF").match(from_list, to_list)

which returns

>>> model.get_matches()
         From      To    Similarity
0       apple   apple    1.000000
1      apples  apples    1.000000
2        appl   apple    0.783751
3       recal    None    0.000000
4       house   mouse    0.587927
5  similarity    None    0.000000

I don't understand what is returned. If polyfuzz is doing pairwise comparion from from list to to list, it should return len(from)*len(to) rows. So 18 results in this case.

In the example, apple->apples, apple->mouse, apples->apple, apples->mouse and so on are missing.

MaartenGr commented 7 months ago

PolyFuzz extracts the most similar examples from the from_list and attempts to find the best matches. If you want to extract more matches, use the top_n parameter in the .match function.

raffaem commented 7 months ago

Can we take a step back?

How does polyfuzz knows what are the "most similar examples from the from_list"?

It must calculate each pairwise distance, so m*n distances, where m=len(from_list) and n=len(to_list), and take the most similar examples, right?

Because the docs say that top_n is "only implemented for polyfuzz.models.TFIDF and polyfuzz.models.Embeddings".

So I don't understand with EditDistance how does it know what are the most similar examples without calculating each pairwise distance.

MaartenGr commented 7 months ago

I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py

raffaem commented 7 months ago

I continue to do NOT understand.

Sorting a list of floating point numbers and returning the biggest N ones is less computationally expensive than returning all of them?

Dec 6, 2023 7:31:27 PM Maarten Grootendorst @.***>:

I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py

— Reply to this email directly, view it on GitHub[https://github.com/MaartenGr/PolyFuzz/issues/73#issuecomment-1843455458], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ANBZZ5S5SBNZ4SLM3FMDSMLYIC2XPAVCNFSM6AAAAABAJVAZSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGQ2TKNBVHA]. You are receiving this because you authored the thread.[Tracking image][https://github.com/notifications/beacon/ANBZZ5RC36VM35S5UC5XU2TYIC2XPA5CNFSM6AAAAABAJVAZSKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTN4DS6E.gif]

MaartenGr commented 7 months ago

I think the source code above would be a good start. Notice that there are several ways to calculate pairwise distances in that method. Doing so for a m x n matrix can be computationally difficult. Instead, getting only the best can be more feasible depending on the method you use. The link above might help you understand.