Closed LukasWallrich closed 4 years ago
Poking @manuelbickel
I am traveling at the moment so just a short response, will have a closer look at a later time. The standard way of text2vec when creating a tcm is to go through the document token by token (@dselivanov please correct me if I am wrong), therefore, loosely spoken, in the standard case, the number of virtual docs i segms equal the number of tokens assuming you creating the tcm_ref on the basis of a single document, character vector of length 1 (of course, you could program different logics yourself such as only starting windows at each, e.g., 110th token or segmenting the document ). For some reason I used a difficult formula to express this, can't remember why at the moment, will check again.
Áfter rethinking, I think I can halfway remember now. You spotted a small but decisive mistake in the documentation. Thanks for noticing! Have created PR to correct this.
It should read tokens_ext
not only tokens
. In several lines before the one you refer to we do:
external_reference_corpus = tolower(movie_review$review[501:1000])
tokens_ext = word_tokenizer(external_reference_corpus)
The reference_tcm created from these tokens_ext
is based on multiple documents. For each document create_tcm
runs a sliding window over each word and counts co-occurrences.
For counting the number of sliding_windows, i.e., the virtual documents of specified length, we need to count how many tokens/words we have in total. This is the reason for the seemingly complicated expression, which should read:
n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)}))
I hope this sounds reasonable now (and that I got the internal logics of create_tcm
right).
Side note: For a scientific study I have created a comprehensive applied code example in the sense of a vignette including examples for coherence calculations. It is in an interim status and not finalized, yet, did not have time yet to format it in a way that it fits into text2vec, but maybe it still helps. Hope I will find the time to polish it up.
Thanks for clarifying! That vignette also looks very helpful, I'll have a good look before finalising my ongoing study.
On Thu, 20 Feb 2020 at 17:41, Manuel Bickel notifications@github.com wrote:
Side note: For a scientific study I have created a comprehensive applied code example in the sense of a vignette https://github.com/manuelbickel/textility/blob/master/vignettes/text_mining_publication_sustainable_energy.Rmd including examples for coherence calculations https://github.com/manuelbickel/textility/blob/69a86118084007578c3ace46095be0adb7d45561/vignettes/text_mining_publication_sustainable_energy.Rmd#L1823. It is in an interim status and not finalized, yet, did not have time yet to format it in a way that it fits into text2vec, but maybe it still helps. Hope I will find the time to polish it up.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/316?email_source=notifications&email_token=AOK6NGKWYCQTX6NY3KH6VO3RD26F5A5CNFSM4KYNZWBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMPMLMY#issuecomment-589219251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOK6NGLA5VF5NFYE63ZVT33RD26F5ANCNFSM4KYNZWBA .
-- -- Lukas Wallrich /// Oben am Mockenweiher 1 / 66117 Saarbrücken,GERMANY / or 19 Mitten House / London SE8 3GL / UK /// +49175 318 2128 /// Skype: lukas.wallrich
Hi there,
Maybe I am misunderstanding the concept entirely, but in the documentation of how to create an external tcm to then calculate coherence, the n_doc_tcm parameter for the coherence() function is to be calculated like this
That formula is independent of the window size - should that be the case? If it is to correspond to the number of sliding windows, that seems odd?
Thanks a lot for clarifying!