Closed dominiqueemmanuel closed 4 years ago
However, if I use "left"
and "right"
for the skip_grams_window_context
parameter and then add the results it seems to be a better result :
library(text2vec)
txt <- c("word1 word2 word3 word4 word5 word6 word7 word8 word1 word2"
,"word1 word8"
,"word9")
it <- itoken(txt,tokenizer = word_tokenizer)
v <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(v)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 2L,weights = rep(1L,2),skip_grams_window_context="symmetric")
tcm2<-create_tcm(it, vectorizer, skip_grams_window = 2L,weights = rep(1L,2),skip_grams_window_context="right")+create_tcm(it, vectorizer, skip_grams_window = 2L,weights = rep(1L,2),skip_grams_window_context="left")
colnames(tcm2)<-rownames(tcm2)<-colnames(tcm)
all(tcm[,"word6"]==0)
#TRUE
all(tcm["word1",]==0)
#TRUE
all(tcm2[,"word6"]==0)
#FALSE: BETTER :)
all(tcm2["word1",]==0)
#FALSE: BETTER :)
NB : I use the 0.5.1 CRAN version of the package.
When using symmetrical context tcm
is upper triangular (this allows to save memory since numb er of occurrences of word_i and word_j is the same as word_j and word_i). So if you need to find all co-occurrences of word_i use tcm[, "word_i"] + tcm["word_i", ]
.
Great!
It seems obvious now you've explained it!
I close the issue.
Best regards, Dominique
Hi,
I wonder if I don't understand how a term-co-occurrence matrix is calculated or if there is a bug in the
create_tcm
function.Here is a simple example where I explain what seems to be an unexpected result
Any idea ?
Best regards, Dom