dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

tcm (by `create_tcm`) is not documented. #340

Open otoomet opened 1 year ago

otoomet commented 1 year ago

I am puzzled what exactly is TCM (term co-occurrence matrix). The documentation of create_tcm just tells that

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

and that its value is

dgTMatrix TCM matrix

Pennington, Socher and Manning, when introducing GloVe, define

matrix of word-word co-occurrence counts be denoted by X, whose entries X$_{ij}$ tabulate the number of times word $j$ occurs in the context of word $i$

My reading is that this matrix should be symmetric, ie $X{ij} = X{ji}$ if the context is symmetric and weights are 1. However, consider a very simple example with window 1:

doc <- c("a b c b a")
it <- itoken(doc)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it,
                  vectorizer,
                  skip_grams_window = 1,
                  skip_grams_window_context = "symmetric",
                  weights=1)
tcm

This results in

3 x 3 sparse Matrix of class "dgTMatrix"
  c a b
c . . 2
a . . 2
b . . .

This is clearly not symmetric, e.g there is no context for word "b". The rest of it makes sense--"c" has two "b"-s as context, and "a" has two "b"-s in a similar fashion.

Does the returned TCM only fill out the upper triangle? This seems to be confirmed when reading documentation for coherence.

I am happy to contribute with PR-s and such, but would like to hear from you before I do this.

dselivanov commented 1 year ago

Hi, yes, the TCM matrix is symmetric, so we keep upper triangular to save memory. PR to update docs is appreciated.