dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Unexpected restults with create_tcm function #308

Closed dominiqueemmanuel closed 4 years ago

dominiqueemmanuel commented 4 years ago

Hi,

I wonder if I don't understand how a term-co-occurrence matrix is calculated or if there is a bug in the create_tcm function.

Here is a simple example where I explain what seems to be an unexpected result

library(text2vec)
txt <- c("word1 word2 word3 word4 word5 word6 word7 word8 word1 word2"
         ,"word1 word8"
         ,"word9")
it <- itoken(txt,tokenizer = word_tokenizer)
v <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(v)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 2L,weights = rep(1L,2),skip_grams_window_context="symmetric")

all(tcm[,"word6"]==0)
#TRUE
# ===>  But I don't undersertand why since the word "word6" surrounds others words...

all(tcm["word1",]==0)
#TRUE
# ===> But I don't undersertand why since the word "word1" is surrounded by others words...

Any idea ?

Best regards, Dom

dominiqueemmanuel commented 4 years ago

However, if I use "left" and "right" for the skip_grams_window_context parameter and then add the results it seems to be a better result :

library(text2vec)
txt <- c("word1 word2 word3 word4 word5 word6 word7 word8 word1 word2"
  ,"word1 word8"
  ,"word9")
it <- itoken(txt,tokenizer = word_tokenizer)
v <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(v)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 2L,weights = rep(1L,2),skip_grams_window_context="symmetric")
tcm2<-create_tcm(it, vectorizer, skip_grams_window = 2L,weights = rep(1L,2),skip_grams_window_context="right")+create_tcm(it, vectorizer, skip_grams_window = 2L,weights = rep(1L,2),skip_grams_window_context="left")
colnames(tcm2)<-rownames(tcm2)<-colnames(tcm)

all(tcm[,"word6"]==0)
#TRUE
all(tcm["word1",]==0)
#TRUE

all(tcm2[,"word6"]==0)
#FALSE: BETTER :)
all(tcm2["word1",]==0)
#FALSE: BETTER :)
dominiqueemmanuel commented 4 years ago

NB : I use the 0.5.1 CRAN version of the package.

dselivanov commented 4 years ago

When using symmetrical context tcm is upper triangular (this allows to save memory since numb er of occurrences of word_i and word_j is the same as word_j and word_i). So if you need to find all co-occurrences of word_i use tcm[, "word_i"] + tcm["word_i", ] .

dominiqueemmanuel commented 4 years ago

Great!

It seems obvious now you've explained it!

I close the issue.

Best regards, Dominique