dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

TfIdf smooth_idf #280

Closed nkonts closed 6 years ago

nkonts commented 6 years ago

Hi, I am not sure if it is a typo, a wrong implementation or my understanding which is at fault:

The documentation of ?TfIdf says:

smooth_idf TRUE smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This prevents division by zero.

which results in the "+1" in the definition of the IDF:

The IDF is defined as follows: idf = log(# documents in the corpus) / (# documents where the term appears + 1)

The wikipedia of Tfidf says that the smooth IDF is defined as:

idf = log( 1 + (# documents in the corpus) / (# documents where the term appears) )

A quick example would be a text with 3 documents: The not smoothed IDF would be have the possible values

log(3/(1:3)) [1] 1.0986123 0.4054651 0.0000000

Smoothed according to the documentation:

log(3/(1:3 + 1)) [1] 0.4054651 0.0000000 -0.2876821

Smoothed according to wiki

log(1 + 3/(1:3)) [1] 1.3862944 0.9162907 0.6931472

I am not sure why the documentation is saying that divison by zero might happen. Because the TfIdf will be calculated by multiplication. If I would want to divide somewhere by the Idf, I would still divide by zero with the current smooth implementation.

The issue I personally have with this representation of the TfIdf is that a word which appears in (# documents in the corupus) - 1 documents will have an Idf of 0 and an Tfidf value of 0, too (as Tf*0 = 0). Which is the same value like a word which does not appear in a document.

A small example which illustrates it:

text <- c("Word1 Word2 Word3",
          "Word1 Word2",
          "Word1")
data <- data.frame("text" = text, id = c(1,2,3), stringsAsFactors = F)

data_tokens <- itoken(data$text, 
                       preprocessor = tolower, tokenizer = word_tokenizer, ids = data$id, progressbar = FALSE)
data_vocab <- create_vocabulary(data_tokens, stopwords = stopwords::stopwords("en"))
vectorizer <- vocab_vectorizer(data_vocab)
data_dtm <- create_dtm(data_tokens, vectorizer)
tfidf_model <- TfIdf$new()
# fit model
data_tfidf <- fit_transform(data_dtm, tfidf_model)

as.matrix(data_tfidf)

as.matrix(data_tfidf)[2,]
# => word2 is not in text[2] according to the TfIdf-matrix (which is wrong!) and as unimportant as word3 which does not appear in text[2] 

word2 appears in (# documents in the corupus) - 1 documents. Plotting the matrix shows that every instance of the Tfidf for word2equals 0 as pointed out by the log computation above. In addition, the term frequency information of word2 in the Matrix gets lost as Tf("word2")*Idf("word2) = 0. A user (or classifier, model...) can now not distinguish if word2is irrelevant with a Termfrequency of 0or if it appeared in (# documents in the corupus) - 1 documents (with a possible relevant term frequency).

dselivanov commented 6 years ago

Hi @Nickkontscha. Thanks a lot for detailed report. I think you are right and current smooth tf-idf formula has this flaw (I would consider it as a bug as I already don't remember reasoning behind why it was done this way). I've changed it to be in line with wikipedia definition.