Open raffaem opened 7 months ago
PolyFuzz extracts the most similar examples from the from_list
and attempts to find the best matches. If you want to extract more matches, use the top_n
parameter in the .match
function.
Can we take a step back?
How does polyfuzz knows what are the "most similar examples from the from_list
"?
It must calculate each pairwise distance, so m*n distances, where m=len(from_list) and n=len(to_list), and take the most similar examples, right?
Because the docs say that top_n
is "only implemented for polyfuzz.models.TFIDF and polyfuzz.models.Embeddings".
So I don't understand with EditDistance
how does it know what are the most similar examples without calculating each pairwise distance.
I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py
I continue to do NOT understand.
Sorting a list of floating point numbers and returning the biggest N ones is less computationally expensive than returning all of them?
Dec 6, 2023 7:31:27 PM Maarten Grootendorst @.***>:
I didn't say that we are not calculating pairwise distances. What is happening here is that after calculating the distances, extracting the best or top n matching values can computationally be expensive. For that reason, there are several options implemented that allow you to select either the best or the top n matches. You can find more about that in the source code here: https://github.com/MaartenGr/PolyFuzz/blob/master/polyfuzz/models/_utils.py
— Reply to this email directly, view it on GitHub[https://github.com/MaartenGr/PolyFuzz/issues/73#issuecomment-1843455458], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ANBZZ5S5SBNZ4SLM3FMDSMLYIC2XPAVCNFSM6AAAAABAJVAZSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGQ2TKNBVHA]. You are receiving this because you authored the thread.[Tracking image][https://github.com/notifications/beacon/ANBZZ5RC36VM35S5UC5XU2TYIC2XPA5CNFSM6AAAAABAJVAZSKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTN4DS6E.gif]
I think the source code above would be a good start. Notice that there are several ways to calculate pairwise distances in that method. Doing so for a m x n matrix can be computationally difficult. Instead, getting only the best can be more feasible depending on the method you use. The link above might help you understand.
The docs provide the following example:
which returns
I don't understand what is returned. If polyfuzz is doing pairwise comparion from
from
list toto
list, it should return len(from)*len(to) rows. So 18 results in this case.In the example, apple->apples, apple->mouse, apples->apple, apples->mouse and so on are missing.