TfIdf smooth_idf - Githubissues

Hi, I am not sure if it is a typo, a wrong implementation or my understanding which is at fault:

The documentation of ?TfIdf says:

smooth_idf TRUE smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This prevents division by zero.

which results in the "+1" in the definition of the IDF:

The IDF is defined as follows: idf = log(# documents in the corpus) / (# documents where the term appears + 1)

The wikipedia of Tfidf says that the smooth IDF is defined as:

idf = log( 1 + (# documents in the corpus) / (# documents where the term appears) )

A quick example would be a text with 3 documents: The not smoothed IDF would be have the possible values

log(3/(1:3)) [1] 1.0986123 0.4054651 0.0000000

Smoothed according to the documentation:

log(3/(1:3 + 1)) [1] 0.4054651 0.0000000 -0.2876821

Smoothed according to wiki

log(1 + 3/(1:3)) [1] 1.3862944 0.9162907 0.6931472

I am not sure why the documentation is saying that divison by zero might happen. Because the TfIdf will be calculated by multiplication. If I would want to divide somewhere by the Idf, I would still divide by zero with the current smooth implementation.

The issue I personally have with this representation of the TfIdf is that a word which appears in (# documents in the corupus) - 1 documents will have an Idf of 0 and an Tfidf value of 0, too (as Tf*0 = 0). Which is the same value like a word which does not appear in a document.

A small example which illustrates it:

text <- c("Word1 Word2 Word3",
          "Word1 Word2",
          "Word1")
data <- data.frame("text" = text, id = c(1,2,3), stringsAsFactors = F)

data_tokens <- itoken(data$text, 
                       preprocessor = tolower, tokenizer = word_tokenizer, ids = data$id, progressbar = FALSE)
data_vocab <- create_vocabulary(data_tokens, stopwords = stopwords::stopwords("en"))
vectorizer <- vocab_vectorizer(data_vocab)
data_dtm <- create_dtm(data_tokens, vectorizer)
tfidf_model <- TfIdf$new()
# fit model
data_tfidf <- fit_transform(data_dtm, tfidf_model)

as.matrix(data_tfidf)

as.matrix(data_tfidf)[2,]
# => word2 is not in text[2] according to the TfIdf-matrix (which is wrong!) and as unimportant as word3 which does not appear in text[2]

word2 appears in (# documents in the corupus) - 1 documents. Plotting the matrix shows that every instance of the Tfidf for word2equals 0 as pointed out by the log computation above. In addition, the term frequency information of word2 in the Matrix gets lost as Tf("word2")*Idf("word2) = 0. A user (or classifier, model...) can now not distinguish if word2is irrelevant with a Termfrequency of 0or if it appeared in (# documents in the corupus) - 1 documents (with a possible relevant term frequency).

dselivanov / text2vec

TfIdf smooth_idf #280