dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
844 stars 134 forks source link

Topic modeling guide #262

Open dselivanov opened 6 years ago

dselivanov commented 6 years ago

It will be useful to create a comprehensive practical guide for topic modeling. Now we have all components in place:

Steps

There are already good vignettes in udpipe package topic modeling and phrase extraction. They can be used as inspiration.

@manuelbickel @jwijffels anything we can add to the plan above?

sjankin commented 6 years ago

Re non-trivial corpus with large number of documents, how about UN General Debate corpus? It's publicly available from Harvard Dataverse: "UNGDC 1970-2017.zip". Direct link here. It covers country statements in the UN General Debate (presidents, prime ministers etc), once per year at the opening of each UN session from 1970 to 2017. Total 7,897 speeches.

manuelbickel commented 6 years ago

We might add some aspects regarding downstream analysis (and maybe visualization depending on the target audience or format of publication).

Regarding downstream analysis we might do (feel free to change/adapt/add):

jwijffels commented 6 years ago

Nice points. I have some time from May 13 onwards to work on this. I would be interested in having a corpus which has the same text in several languages to show that the flow works for all languages with limited manual intervention. In Belgium we have some open data (20000 records if I recall) for all question/answers in parliament for the last years but that is only Dutch and French. It would be nice to have a corpus with also English in it + some more languages (maybe europarl?) FYI. I've also added more docs on multi-word phrase extraction at https://bnosac.github.io/udpipe/docs/doc7.html