Question: How to calculate number of skip gram windows (or logic of co-occurrence with skip-grams)?

Hi Dmitry,

with reference to the implementation of coherence metrics (#252, formerly #241) some metrics use term probabilities instead of counts. Therefore, we need the number of skip grams windows, which represent the number of "virtual documents" used to calculate counts to feed it as n_doc_tcm to the current version of the coherence function. Due to my lack of knowledge on C, I did not fully understand the code to generate tcm and could, thus, not correclty calculate this number. Below are two approaches which I tried and a rough explanation of my intuitive understanding how co-occurrence counting is done (which is all wrong). Maybe you find some time to give me a hint how counting co-occurrence with sliding windows works or provide a function that allows to count the number of windows, from which I could update my understanding? Thanks in advance.

tokens = word_tokenizer("a b c b a x x a b c b a")
it = itoken(tokens)
v = create_vocabulary(it)
vectorizer = vocab_vectorizer(v)
window_size = 2
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = window_size
, weights = rep(1, window_size)
,binary_cooccurence = F, skip_grams_window_context = "symmetric")
tcm
#   c x a b
# c . . 4 4
# x . 1 4 2
# a . . . 4
# b . . . 2

#considering a sliding window that jumps from each token to the next 
#my intuitive understanding is a segmentation similar to the one below 
#for getting co-occurrence / number of windows
#(incomplete windows at edges might be handled differently)
#take the first available full window, (starting position would be the "c")
#"a b c b a"  # coocur(a,b) = 2
#then slide through each token
#"b c b a x"  # cooccur(a,b) = 1
#"c b a x x"  # since only context tokens are used for counting cooccur(a,b) = 0 here
#"b a x x a"  # cooccur(a,b) = 1
#"a x x a b"  # cooccur(a,b) = 1
#"x x a b c"  # again only context tokens
#"x a b c b"  # cooccur(a,b) = 1
#"a b c b a"  # coocur(a,b) = 2
#This would result in higher co-occurrence values as in tcm...
#hence, wrong understanding... 

#To calculate the number of skip gram windows, the following approach is, thus, wrong...
get_n_skip_gram_windows_v1 = function(tokens, window_size) {
  sum(sapply(tokens, function(x) {
    #first window
    n_windows = 1
    #additional windows
    add = length(x) - (1+window_size*2) #symmetric, therefore, times 2
    if (add > 0) {
      n_windows =  n_windows + add
    }
    return(n_windows)
  }))
}

get_n_skip_gram_windows_v1(tokens = tokens, window_size = window_size)
# 8

#also the following alternative approach, does not seem to be correct...
#this would resemble splitting tokens into full windows once
get_n_skip_gram_windows_v2 = function(tokens, window_size) {
  sum(sapply(tokens, function(x) {ceiling(length(x)/(window_size*2))})) #symmetric, therefore, times 2
}

get_n_skip_gram_windows_v2(tokens = tokens, window_size = window_size)
# 3

dselivanov / text2vec

Question: How to calculate number of skip gram windows (or logic of co-occurrence with skip-grams)? #253