MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
725 stars 68 forks source link

TF-IDF is giving same score for different to_list #48

Closed ashutosh486 closed 1 year ago

ashutosh486 commented 1 year ago

Hi, I am observing that tf-idf is givng exact match for terms that are not exact matches.

For eg:

test_tolist = ["k testtext", "testtext", "x testtext", "j testtext", "i q testtext"]
test_fromlist = ["i testtext"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf_test")

PolyFuzz(test_model).match(test_fromlist, test_tolist).get_matches()
Output: From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4 To_5 Similarity_5
0 i testtext i q testtext 1 j testtext 1 x testtext 1 testtext 1 k testtext 1

Explanation: Here i testtext is being exactly matched to "x testtext" and others even though there is a difference. I also tested the same on RapidFuzz with scorer as fuzz.ratio and it is giving required result. I am assuming the scorer in TF-IDF is set to partial_token_ratio as RapidFuzz is also giving same result.

MaartenGr commented 1 year ago

That is correct. This implementation of the TF-IDF similarity measure removes n-grams that have whitespaces in them in order to prevent RAM issues when analyzing large datasets:

https://github.com/MaartenGr/PolyFuzz/blob/b26638ff051a2d0d7c100619657b5703e47c9365/polyfuzz/models/_tfidf.py#L130

If you do want that for your dataset, you can remove that line yourself and create a TF-IDF vectorizer with your own custom settings according to documentation.