Closed nkonts closed 6 years ago
Hi @Nickkontscha. Thanks a lot for detailed report. I think you are right and current smooth tf-idf formula has this flaw (I would consider it as a bug as I already don't remember reasoning behind why it was done this way). I've changed it to be in line with wikipedia definition.
Hi, I am not sure if it is a typo, a wrong implementation or my understanding which is at fault:
The documentation of
?TfIdf
says:which results in the "+1" in the definition of the IDF:
The wikipedia of Tfidf says that the smooth IDF is defined as:
A quick example would be a text with 3 documents: The not smoothed IDF would be have the possible values
Smoothed according to the documentation:
Smoothed according to wiki
I am not sure why the documentation is saying that divison by zero might happen. Because the TfIdf will be calculated by multiplication. If I would want to divide somewhere by the Idf, I would still divide by zero with the current smooth implementation.
The issue I personally have with this representation of the TfIdf is that a word which appears in
(# documents in the corupus) - 1
documents will have an Idf of0
and an Tfidf value of0
, too (asTf*0 = 0
). Which is the same value like a word which does not appear in a document.A small example which illustrates it:
word2
appears in(# documents in the corupus) - 1
documents. Plotting the matrix shows that every instance of the Tfidf forword2
equals0
as pointed out by the log computation above. In addition, the term frequency information ofword2
in the Matrix gets lost asTf("word2")*Idf("word2) = 0
. A user (or classifier, model...) can now not distinguish ifword2
is irrelevant with a Termfrequency of0
or if it appeared in(# documents in the corupus) - 1
documents (with a possible relevant term frequency).