When I use the built in function for TF-IDF weighting I get different values, by about a factor of 10, does anybody have any idea why?
## Make a Document Term Matrix
### RowColumnMatrix
tweet_matrix_tdm <- as.matrix(TermDocumentMatrix(tweet_corpus_clean))
tweet_matrix_dtm <- as.matrix(DocumentTermMatrix(tweet_corpus_clean))
colnames(tweet_matrix_dtm)[1:10]
## Use Term-Frequency and Inter-Document Frequency
N <- nrow(tweet_matrix_dtm) # Number of Documents
ft=colSums(tweet_matrix_dtm > 0) #in how many documents term t appeared in,
TF <- log(tweet_matrix_dtm + 1)
IDF <- log(N/ft)
# Because each term in TF needs to be multiplied through
# each column of IDF there would be two ways to do it,
# a for loop which will be really slow
# Diagonalise the matrix then use Matrix multiplication
tweet_weighted <- TF %*% diag(IDF)
colnames(tweet_weighted) <- colnames(tweet_matrix_dtm)
### RowColumnMatrix
tweet_weighted[1:12, 1:7]
## This expects a TermDocumentMatrix, read the help
## ?weightTFIDF
tweet_weighted_two <- as.matrix(weightTfIdf(TermDocumentMatrix(tweet_corpus_clean)))
tweet_weighted_two[1:7, 1:12] %>% t
I did a little more investigating, I think the built in function is using base 2 not base e, but, that still doesn't explain the descrepancy of 10 fold...
When I use the built in function for TF-IDF weighting I get different values, by about a factor of 10, does anybody have any idea why?
Output: