dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Question: How to calculate number of skip gram windows (or logic of co-occurrence with skip-grams)? #253

Closed manuelbickel closed 5 years ago

manuelbickel commented 6 years ago

Hi Dmitry,

with reference to the implementation of coherence metrics (#252, formerly #241) some metrics use term probabilities instead of counts. Therefore, we need the number of skip grams windows, which represent the number of "virtual documents" used to calculate counts to feed it as n_doc_tcm to the current version of the coherence function. Due to my lack of knowledge on C, I did not fully understand the code to generate tcm and could, thus, not correclty calculate this number. Below are two approaches which I tried and a rough explanation of my intuitive understanding how co-occurrence counting is done (which is all wrong). Maybe you find some time to give me a hint how counting co-occurrence with sliding windows works or provide a function that allows to count the number of windows, from which I could update my understanding? Thanks in advance.

tokens = word_tokenizer("a b c b a x x a b c b a")
it = itoken(tokens)
v = create_vocabulary(it)
vectorizer = vocab_vectorizer(v)
window_size = 2
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = window_size
, weights = rep(1, window_size)
,binary_cooccurence = F, skip_grams_window_context = "symmetric")
tcm
#   c x a b
# c . . 4 4
# x . 1 4 2
# a . . . 4
# b . . . 2

#considering a sliding window that jumps from each token to the next 
#my intuitive understanding is a segmentation similar to the one below 
#for getting co-occurrence / number of windows
#(incomplete windows at edges might be handled differently)
#take the first available full window, (starting position would be the "c")
#"a b c b a"  # coocur(a,b) = 2
#then slide through each token
#"b c b a x"  # cooccur(a,b) = 1
#"c b a x x"  # since only context tokens are used for counting cooccur(a,b) = 0 here
#"b a x x a"  # cooccur(a,b) = 1
#"a x x a b"  # cooccur(a,b) = 1
#"x x a b c"  # again only context tokens
#"x a b c b"  # cooccur(a,b) = 1
#"a b c b a"  # coocur(a,b) = 2
#This would result in higher co-occurrence values as in tcm...
#hence, wrong understanding... 

#To calculate the number of skip gram windows, the following approach is, thus, wrong...
get_n_skip_gram_windows_v1 = function(tokens, window_size) {
  sum(sapply(tokens, function(x) {
    #first window
    n_windows = 1
    #additional windows
    add = length(x) - (1+window_size*2) #symmetric, therefore, times 2
    if (add > 0) {
      n_windows =  n_windows + add
    }
    return(n_windows)
  }))
}

get_n_skip_gram_windows_v1(tokens = tokens, window_size = window_size)
# 8

#also the following alternative approach, does not seem to be correct...
#this would resemble splitting tokens into full windows once
get_n_skip_gram_windows_v2 = function(tokens, window_size) {
  sum(sapply(tokens, function(x) {ceiling(length(x)/(window_size*2))})) #symmetric, therefore, times 2
}

get_n_skip_gram_windows_v2(tokens = tokens, window_size = window_size)
# 3
dselivanov commented 6 years ago

I can easily add counter of windows to C++ code. In your example above you use window_size = 2 but in comments you use window of length 4. Also note that when context is symmetric than we count cooc(a, b) = cooc(b, a). So in example above windows and cumulative counts look like ("central" word is on the first place in each row):

"a b c b a x x a b c b a"
a b c 1
b c b 1
c b a 1
b a x 2
a x x 2
x x a 2
x a b 2
a b c 3
b c b 3
c b a 3
b a _ 4
a _  _ 4