Closed ashutosh486 closed 1 year ago
That is correct. This implementation of the TF-IDF similarity measure removes n-grams that have whitespaces in them in order to prevent RAM issues when analyzing large datasets:
If you do want that for your dataset, you can remove that line yourself and create a TF-IDF vectorizer with your own custom settings according to documentation.
Hi, I am observing that tf-idf is givng exact match for terms that are not exact matches.
For eg:
Explanation: Here i testtext is being exactly matched to "x testtext" and others even though there is a difference. I also tested the same on RapidFuzz with scorer as fuzz.ratio and it is giving required result. I am assuming the scorer in TF-IDF is set to partial_token_ratio as RapidFuzz is also giving same result.