RyanGreenup / SWA-Project

Github for Social Web Analytics
0 stars 0 forks source link

TF-IDF Weighting different using built-in function #7

Open RyanGreenup opened 4 years ago

RyanGreenup commented 4 years ago

When I use the built in function for TF-IDF weighting I get different values, by about a factor of 10, does anybody have any idea why?


## Make a Document Term Matrix
                         ### RowColumnMatrix
tweet_matrix_tdm   <- as.matrix(TermDocumentMatrix(tweet_corpus_clean))
tweet_matrix_dtm   <- as.matrix(DocumentTermMatrix(tweet_corpus_clean))
colnames(tweet_matrix_dtm)[1:10]

## Use Term-Frequency and Inter-Document Frequency
N <- nrow(tweet_matrix_dtm)   # Number of Documents
ft=colSums(tweet_matrix_dtm > 0) #in how many documents term t appeared in,
TF <- log(tweet_matrix_dtm + 1)
IDF <- log(N/ft)

    # Because each term in TF needs to be multiplied through
    # each column of IDF there would be two ways to do it,
      # a for loop which will be really slow
      # Diagonalise the matrix then use Matrix multiplication

tweet_weighted           <- TF %*% diag(IDF)
colnames(tweet_weighted) <- colnames(tweet_matrix_dtm)

                         ### RowColumnMatrix
tweet_weighted[1:12, 1:7]

## This expects a TermDocumentMatrix, read the help
## ?weightTFIDF
tweet_weighted_two <- as.matrix(weightTfIdf(TermDocumentMatrix(tweet_corpus_clean)))
tweet_weighted_two[1:7, 1:12] %>% t

Output:

    Terms
Docs      back     black   compani     don’t     flag’      good     howev
  1  0.2982892 0.3821928 0.3482892 0.3190411 0.4982892 0.5107803 0.3821928
  2  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
  3  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
  4  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.4246587
  5  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
  6  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3821928
  7  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
  8  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
  9  0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3821928
  10 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
  11 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
  12 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> 

Docs     back    black  compani    don’t    flag’     good    howev
  1  2.866279 3.672514 3.346732 3.065685 4.788091 3.889592 3.672514
  2  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  3  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  4  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.672514
  5  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  6  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.672514
  7  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  8  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  9  0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.672514
  10 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  11 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
  12 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
> 
RyanGreenup commented 4 years ago

I did a little more investigating, I think the built in function is using base 2 not base e, but, that still doesn't explain the descrepancy of 10 fold...