dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

fcm mistake on first token? #313

Closed kbenoit closed 4 years ago

kbenoit commented 4 years ago

When I was investigating https://github.com/quanteda/quanteda/issues/1825, I noticed this:

For the text below, the symmetric 3-token window context-weighted count for A should be:

A D A C E A D F E B A C E D 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A3 to A1 is 1/2 A1 to A3 is 1/2 A6 to A3 is 1/3 A3 to A6 is 1/3 sum = 1.6667 but tcm computes half of this.

library("text2vec")
library("quanteda")

txt <- "A D A C E A D F E B A C E D"

# text2vec::tcm()

tokens <- txt %>%
  tolower() %>%
  word_tokenizer()
it <- itoken(tokens)
v <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(v)
tcm <- create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L)

# convert to a symmetric matrix to facilitate the sorting
tcm <- as.matrix(tcm)
ttcm <- tcm
diag(ttcm) <- 0
tcm <- tcm + t(ttcm)

# sort the matrix according to rowname-colname and convert back to a upper triangle matrix
tcm <- tcm[order(rownames(tcm)), order(colnames(tcm))]
tcm[lower.tri(tcm, diag = FALSE)] <- 0
tcm
##           a b        c         d        e         f
## a 0.8333334 1 2.833333 3.3333333 2.833333 0.8333334
## b 0.0000000 0 0.500000 0.3333333 1.333333 0.5000000
## c 0.0000000 0 0.000000 1.3333334 2.333333 0.0000000
## d 0.0000000 0 0.000000 0.0000000 2.333333 1.0000000
## e 0.0000000 0 0.000000 0.0000000 0.000000 1.3333334
## f 0.0000000 0 0.000000 0.0000000 0.000000 0.0000000

# quanteda::fcm()

toks <- tokens(char_tolower(txt), remove_punct = TRUE)
fcm <- fcm(toks, context = "window", count = "weighted", weights = 1 / seq_len(3), window = 3)
fcm <- fcm_sort(fcm)
fcm
## Feature co-occurrence matrix of: 6 by 6 features.
##         features
## features        a b        c         d        e         f
##        a 1.666667 1 2.833333 3.3333333 2.833333 0.8333333
##        b 0        0 0.500000 0.3333333 1.333333 0.5000000
##        c 0        0 0        1.3333333 2.333333 0        
##        d 0        0 0        0         2.333333 1.0000000
##        e 0        0 0        0         0        1.3333333
##        f 0        0 0        0         0        0

🤔

dselivanov commented 4 years ago

The issue is in the following line:

diag(ttcm) <- 0

no need to drop diagonal.