Calculation of n_skip_gram_windows independent of window size?

dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

http://text2vec.org

Other

852 stars 136 forks source link

Calculation of n_skip_gram_windows independent of window size? #316

Closed LukasWallrich closed 4 years ago

LukasWallrich commented 4 years ago

Hi there,

Maybe I am misunderstanding the concept entirely, but in the documentation of how to create an external tcm to then calculate coherence, the n_doc_tcm parameter for the coherence() function is to be calculated like this

# get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument
n_skip_gram_windows = sum(sapply(tokens, function(x) {length(x)}))

That formula is independent of the window size - should that be the case? If it is to correspond to the number of sliding windows, that seems odd?

Thanks a lot for clarifying!

dselivanov commented 4 years ago

Poking @manuelbickel

manuelbickel commented 4 years ago

I am traveling at the moment so just a short response, will have a closer look at a later time. The standard way of text2vec when creating a tcm is to go through the document token by token (@dselivanov please correct me if I am wrong), therefore, loosely spoken, in the standard case, the number of virtual docs i segms equal the number of tokens assuming you creating the tcm_ref on the basis of a single document, character vector of length 1 (of course, you could program different logics yourself such as only starting windows at each, e.g., 110th token or segmenting the document ). For some reason I used a difficult formula to express this, can't remember why at the moment, will check again.

manuelbickel commented 4 years ago

Áfter rethinking, I think I can halfway remember now. You spotted a small but decisive mistake in the documentation. Thanks for noticing! Have created PR to correct this.

It should read tokens_ext not only tokens. In several lines before the one you refer to we do: external_reference_corpus = tolower(movie_review$review[501:1000]) tokens_ext = word_tokenizer(external_reference_corpus)

The reference_tcm created from these tokens_ext is based on multiple documents. For each document create_tcm runs a sliding window over each word and counts co-occurrences.

For counting the number of sliding_windows, i.e., the virtual documents of specified length, we need to count how many tokens/words we have in total. This is the reason for the seemingly complicated expression, which should read: n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)}))

I hope this sounds reasonable now (and that I got the internal logics of create_tcm right).

manuelbickel commented 4 years ago

Side note: For a scientific study I have created a comprehensive applied code example in the sense of a vignette including examples for coherence calculations. It is in an interim status and not finalized, yet, did not have time yet to format it in a way that it fits into text2vec, but maybe it still helps. Hope I will find the time to polish it up.

LukasWallrich commented 4 years ago

Thanks for clarifying! That vignette also looks very helpful, I'll have a good look before finalising my ongoing study.

On Thu, 20 Feb 2020 at 17:41, Manuel Bickel notifications@github.com wrote:

Side note: For a scientific study I have created a comprehensive applied code example in the sense of a vignette https://github.com/manuelbickel/textility/blob/master/vignettes/text_mining_publication_sustainable_energy.Rmd including examples for coherence calculations https://github.com/manuelbickel/textility/blob/69a86118084007578c3ace46095be0adb7d45561/vignettes/text_mining_publication_sustainable_energy.Rmd#L1823. It is in an interim status and not finalized, yet, did not have time yet to format it in a way that it fits into text2vec, but maybe it still helps. Hope I will find the time to polish it up.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/316?email_source=notifications&email_token=AOK6NGKWYCQTX6NY3KH6VO3RD26F5A5CNFSM4KYNZWBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMPMLMY#issuecomment-589219251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOK6NGLA5VF5NFYE63ZVT33RD26F5ANCNFSM4KYNZWBA .

-- -- Lukas Wallrich /// Oben am Mockenweiher 1 / 66117 Saarbrücken,GERMANY / or 19 Mitten House / London SE8 3GL / UK /// +49175 318 2128 /// Skype: lukas.wallrich