results of get_matches() are not sorted by similarity score for all the values

ashutosh486 commented 1 year ago

Hi,

I was running polyfuzz tfidf model to get the matches but few rows of the result was not sorted as per the top_n similarity score.

tfidf_model = PolyFuzz(tfidf_matcher)
tfidf_model.match(from_list, to_list)
tfidf_model.get_matches()

eg:		From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4	To_5	Similarity_5
21	3 IN 1 LAVENDER & CAMOMILE	2 IN 1 LAVENDER & CAMOMILE	0.938	3 IN 1 LAVENDER & CAMOMILE	1	3 IN 1 LAVENDER	0.771	3 IN 1 LAVENDER & CHAMOMILE	0.831	LAVENDER CAMOMILE	0.764

MaartenGr commented 1 year ago

Could create a minimal example out of what you show here? So with values for from_list and to_list? Also, with the value for top_n that you selected? That way, it makes it a bit easier for me to figure out what exactly is going on.

ashutosh486 commented 1 year ago

Please find below a minimal example:

test_tolist_1 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "3 IN 1 LAVENDER & CHAMOMILE", "LAVENDER CAMOMILE"]

test_tolist_2 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "LAVENDER CAMOMILE"]

test_fromlist = ["3 IN 1 LAVENDER & CAMOMILE"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf")
# test_model
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_1)["TF-IDF"]
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_2)["TF-IDF"]

Output for test_tolist_1:		From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4	To_5	Similarity_5
0	3 IN 1 LAVENDER & CAMOMILE	3 IN 1 LAVENDER & CAMOMILE	1	3 IN 1 LAVENDER	0.733	LAVENDER CAMOMILE	0.81	2 IN 1 LAVENDER & CAMOMILE	0.887	3 IN 1 LAVENDER & CHAMOMILE	0.696

Output for test_tolist_2:		From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4
0	3 IN 1 LAVENDER & CAMOMILE	3 IN 1 LAVENDER & CAMOMILE	1	LAVENDER CAMOMILE	0.797	2 IN 1 LAVENDER & CAMOMILE	0.893	3 IN 1 LAVENDER	0.747

Problems:

Similarity score is sorted
by removing or adding new text in the to_list the similarity score changes

Just to add to this: I have commented following line of code as I had asked in the previous issue: https://github.com/MaartenGr/PolyFuzz/issues/48 https://github.com/MaartenGr/PolyFuzz/blob/b26638ff051a2d0d7c100619657b5703e47c9365/polyfuzz/models/_tfidf.py#L130

MaartenGr commented 1 year ago

Similarity score is sorted

Did you install PolyFuzz through pip install polyfuzz[fast]? If so, then I believe it is since sparse_dot_topn does not return the similarities in order. I would have to check what exactly goes on there.

by removing or adding new text in the to_list the similarity score changes

The to_list is used together with the from_list in order to create the feature matrix as a result of the TF-IDF calculation. As such, it is indeed possible that the similarity score then changes. The more words you put in either list, the more the resulting feature matrix can generalize and the more accurate your similarity function becomes.

MaartenGr / PolyFuzz

results of get_matches() are not sorted by similarity score for all the values #50