MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

Is the TFIDF implementation correct? #27

Closed Stefannn closed 2 years ago

Stefannn commented 3 years ago

Hi! First of all thank you for the neat library and great documentation!

I have a question on the tfidf implementation: What's the reason to fit the TfidfVectorizer on both the to_list and from_list (line 99 in models/_tfidf.py)? Intuitively I would have called fit() with the to_list only, otherwise the similarity scores depend on the from_list which is usually not desirable?

Thanks!

MaartenGr commented 3 years ago

Yes, without fitting on both the to_list and the from_list different matrices would be created with different shapes, if fitted independently. It is important that TF-IDF is trained on both lists such that it gets the same vocabulary. There is a good chance that if you would only fit it on the to_list, information would be missing with respect to the from_list which would in turn decrease performance.