Open ashutosh486 opened 1 year ago
Could create a minimal example out of what you show here? So with values for from_list
and to_list
? Also, with the value for top_n
that you selected? That way, it makes it a bit easier for me to figure out what exactly is going on.
Please find below a minimal example:
test_tolist_1 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE",
"3 IN 1 LAVENDER", "3 IN 1 LAVENDER & CHAMOMILE", "LAVENDER CAMOMILE"]
test_tolist_2 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE",
"3 IN 1 LAVENDER", "LAVENDER CAMOMILE"]
test_fromlist = ["3 IN 1 LAVENDER & CAMOMILE"]
test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5, model_id = "tfidf")
# test_model
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_1)["TF-IDF"]
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_2)["TF-IDF"]
Output for test_tolist_1: | From | To | Similarity | To_2 | Similarity_2 | To_3 | Similarity_3 | To_4 | Similarity_4 | To_5 | Similarity_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 IN 1 LAVENDER & CAMOMILE | 3 IN 1 LAVENDER & CAMOMILE | 1 | 3 IN 1 LAVENDER | 0.733 | LAVENDER CAMOMILE | 0.81 | 2 IN 1 LAVENDER & CAMOMILE | 0.887 | 3 IN 1 LAVENDER & CHAMOMILE | 0.696 |
Output for test_tolist_2: | From | To | Similarity | To_2 | Similarity_2 | To_3 | Similarity_3 | To_4 | Similarity_4 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3 IN 1 LAVENDER & CAMOMILE | 3 IN 1 LAVENDER & CAMOMILE | 1 | LAVENDER CAMOMILE | 0.797 | 2 IN 1 LAVENDER & CAMOMILE | 0.893 | 3 IN 1 LAVENDER | 0.747 |
Problems:
Just to add to this: I have commented following line of code as I had asked in the previous issue: https://github.com/MaartenGr/PolyFuzz/issues/48 https://github.com/MaartenGr/PolyFuzz/blob/b26638ff051a2d0d7c100619657b5703e47c9365/polyfuzz/models/_tfidf.py#L130
Similarity score is sorted
Did you install PolyFuzz through pip install polyfuzz[fast]
? If so, then I believe it is since sparse_dot_topn
does not return the similarities in order. I would have to check what exactly goes on there.
by removing or adding new text in the to_list the similarity score changes
The to_list
is used together with the from_list
in order to create the feature matrix as a result of the TF-IDF calculation. As such, it is indeed possible that the similarity score then changes. The more words you put in either list, the more the resulting feature matrix can generalize and the more accurate your similarity function becomes.
Hi,
I was running polyfuzz tfidf model to get the matches but few rows of the result was not sorted as per the top_n similarity score.