TF-IDF is giving same score for different to_list

MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.

MIT License

725 stars 68 forks source link

Hi, I am observing that tf-idf is givng exact match for terms that are not exact matches.

For eg:

test_tolist = ["k testtext", "testtext", "x testtext", "j testtext", "i q testtext"]
test_fromlist = ["i testtext"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf_test")

PolyFuzz(test_model).match(test_fromlist, test_tolist).get_matches()

Output:		From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4	To_5	Similarity_5
0	i testtext	i q testtext	1	j testtext	1	x testtext	1	testtext	1	k testtext	1

Explanation: Here i testtext is being exactly matched to "x testtext" and others even though there is a difference. I also tested the same on RapidFuzz with scorer as fuzz.ratio and it is giving required result. I am assuming the scorer in TF-IDF is set to partial_token_ratio as RapidFuzz is also giving same result.

MaartenGr / PolyFuzz

TF-IDF is giving same score for different to_list #48