best N matches - Githubissues

tap222 commented 3 years ago

While comparing from_list and to_list, I am not getting any parameter to get the best N next matches for each row.

MaartenGr commented 3 years ago

You are correct! There is currently no option to select the best n matches. This has to do with the cosine similarity measure that is typically used when comparing string embeddings. I am using one of three techniques for that:

sklearn.metrics.pairwise.cosine_similarity
sklearn.neighbors.NearestNeighbors
sparse_dot_topn.awesome_cossim_topn

With scikit-learns cosine_similarity function we can compare all strings and returning the top n should be doable. However, kNN and sparse_dot_topn give you the option to select the single best match which significantly decreases the necessary memory as it does not need to keep the entire similarity matrix in its memory. This is why I prefer it to be at the single best.

Moreover, some similarity measures, such as RapidFuzz, give back a single best match which further complicates things. As well as the construction of the BaseMatcher.

tap222 commented 3 years ago

you are very much correct! Actually I am playing around with sparse_dot_topn.awesome_cossim_topn to accommodate the best n matches till now I am not successful but as you have mentioned scikit-learns cosine_similarity it is very much doable but when I am comparing with larger dataset I am facing issues like memory error or taking ample amount of time to populate the results.

tap222 commented 3 years ago

I am able to do it efficiently to find the best matches.

MaartenGr / PolyFuzz

best N matches #11