dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

coherence documentation #328

Closed marcburri closed 3 years ago

marcburri commented 3 years ago

Hi,

I was wondering whether the documentation for the coherence() function should be adjusted.

Namely the following line where you explain the steps to create TCM for extrinsic measures from an external corpus

# get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument
n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)}))

should look like this

# get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument
n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {max(c(length(x)-window_size+1,1))}))

since a document of say length 111 tokens would give us 2 virtual documents with window_size=110 and not 111 as the current version suggests.

manuelbickel commented 3 years ago

Hi, thank you for your interest in the details. I needed some time to understand the general logic of how co-occurrence is counted in the context of text2vec myself. The package uses a sliding window that moves over the text - hence, a windows is created for each token in a given text. You might find a general discussion that might help you to understand the logic here in the following issue: #253. I hope this helps and you can follow the logic.

marcburri commented 3 years ago

Thank you, everything clear now.

Am 02.11.2020 um 22:05 schrieb Manuel Bickel notifications@github.com:

Hi, thank you for your interest in the details. I needed some time to understand the general logic of how co-occurrence is counted in the context of text2vec. The package uses a sliding window that moves over the text - hence, a windows is created for each token in a given text. You might find a general discussion that might help you to understand the logic here in the following issue: #253 https://github.com/dselivanov/text2vec/issues/253. I hope this helps and you can follow the logic.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/328#issuecomment-720724130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL5BJGMKNOXIFL6AQKQ7KQTSN4NLPANCNFSM4THTBWCQ.