MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
736 stars 67 forks source link

results of get_matches() are not sorted by similarity score for all the values #50

Open ashutosh486 opened 1 year ago

ashutosh486 commented 1 year ago

Hi,

I was running polyfuzz tfidf model to get the matches but few rows of the result was not sorted as per the top_n similarity score.

tfidf_model = PolyFuzz(tfidf_matcher)
tfidf_model.match(from_list, to_list)
tfidf_model.get_matches()
eg: From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4 To_5 Similarity_5
21 3 IN 1 LAVENDER & CAMOMILE 2 IN 1 LAVENDER & CAMOMILE 0.938 3 IN 1 LAVENDER & CAMOMILE 1 3 IN 1 LAVENDER 0.771 3 IN 1 LAVENDER & CHAMOMILE 0.831 LAVENDER CAMOMILE 0.764
MaartenGr commented 1 year ago

Could create a minimal example out of what you show here? So with values for from_list and to_list? Also, with the value for top_n that you selected? That way, it makes it a bit easier for me to figure out what exactly is going on.

ashutosh486 commented 1 year ago

Please find below a minimal example:

test_tolist_1 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "3 IN 1 LAVENDER & CHAMOMILE", "LAVENDER CAMOMILE"]

test_tolist_2 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "LAVENDER CAMOMILE"]

test_fromlist = ["3 IN 1 LAVENDER & CAMOMILE"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf")
# test_model
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_1)["TF-IDF"]
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_2)["TF-IDF"]
Output for test_tolist_1: From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4 To_5 Similarity_5
0 3 IN 1 LAVENDER & CAMOMILE 3 IN 1 LAVENDER & CAMOMILE 1 3 IN 1 LAVENDER 0.733 LAVENDER CAMOMILE 0.81 2 IN 1 LAVENDER & CAMOMILE 0.887 3 IN 1 LAVENDER & CHAMOMILE 0.696
Output for test_tolist_2: From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4
0 3 IN 1 LAVENDER & CAMOMILE 3 IN 1 LAVENDER & CAMOMILE 1 LAVENDER CAMOMILE 0.797 2 IN 1 LAVENDER & CAMOMILE 0.893 3 IN 1 LAVENDER 0.747

Problems:

  1. Similarity score is sorted
  2. by removing or adding new text in the to_list the similarity score changes

Just to add to this: I have commented following line of code as I had asked in the previous issue: https://github.com/MaartenGr/PolyFuzz/issues/48 https://github.com/MaartenGr/PolyFuzz/blob/b26638ff051a2d0d7c100619657b5703e47c9365/polyfuzz/models/_tfidf.py#L130

MaartenGr commented 1 year ago

Similarity score is sorted

Did you install PolyFuzz through pip install polyfuzz[fast]? If so, then I believe it is since sparse_dot_topn does not return the similarities in order. I would have to check what exactly goes on there.

by removing or adding new text in the to_list the similarity score changes

The to_list is used together with the from_list in order to create the feature matrix as a result of the TF-IDF calculation. As such, it is indeed possible that the similarity score then changes. The more words you put in either list, the more the resulting feature matrix can generalize and the more accurate your similarity function becomes.