Open dbrgn opened 8 years ago
Currently, there isn't a streamlined way to get the TF-IDF values for every word that occurs in a document.
As I understand it, TF-IDF is more useful for checking to see if a word is important in a corpus of texts. I'm not sure how often it is used to compare the similarity between documents (I have pretty limited experience in the space, so it is probably best not to take my word). I think IDF makes sense on only 2 documents for educational purposes, but you would typically want more to make your score more accurate.
I think you're right. I ended up simply implementing term frequency myself. It's not generic now (the way it would have been using your library) and could have been nicer if there were already a nice vector math library à la numpy, but it works and has no external dependencies :)
Working with vectors could still be interesting though.
I'm still trying to wrap my head around TF-IDF, therefore this might be a stupid question :)
I want to compare the similarity between two documents. I already have code in place to extract the words from the documents and to count the words. The result is a
HashMap<String, usize>
.What I want to get now is a vector that contains TF-IDF values for every word that occurs in the documents, so that I can determine the cosine similarity between them.
Is this possible with the current API? If I understand it correctly, the
tfidf
function simply calculates the TF-IDF value for a single word, right? Does IDF even make much sense if there are only 2 documents?