dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
851 stars 136 forks source link

Minor update of coherence #258

Closed manuelbickel closed 6 years ago

manuelbickel commented 6 years ago

Hi Dmitriy, sorry for sending a PR just after you have merged the first one. I have realized that we lost something during the revision steps.

The default input tcm only has entries in upper.tri (and if added manually also in diag). In the first version of coherence I had some code to make matrix symmetric, which was not needed for majority of metrics, therefore we removed it. However, for the two indirect measures using cosim (which I only added later) this matters. Since first the NPMI of each top word with each other top word has to be calculated, we need a fully symmetric tcm (or subsets thereof) with entries also in lower.tri. I have added a single line to cover this in the respective metrics (lines 336 and 360). and further clarified that the input tcm has to be a "upper.tri plus diag tcm" (line 89). I hope you agree to this logic.

Another thing I changed is in the example in documentation regarding how the number of skip gram windows is calculated (line 164). I hope I have understood it correclty now...

Furthermore, I have removed some typos in documentation.

manuelbickel commented 6 years ago

Was not sure were the right place is to post interim experience with coherence, therefore, I just post it here in the context of this PR.

In my case it seems that the "simpler" metrics npmi/pmi/difference are the informative metrics regarding selection of the number of topics (the fit of the metrics to human judgement is, of course, another thing to be considered separately) - see this picture, i.e., loess smoothed coherence scores and loglik (abbreviations: ext: external corpus; int: internal corpus; ws: window size). Main corpus consists of about 30000 scientific abstracts in the field of "sustainable energy" and reference corpus for coherence are about 2000 Wikipedia articles that have potentially high thematic fit (semi-manually compiled via WikipediR package). More complex metrics might perform better with better reference corpus, hence, I do not want to give a final judgement about the suitability of the various metrics in general.

Apart from comparing different metrics, the picture also shows the difference between the current implementation of the metrics using cosim against the proposed updated versions in this PR. Although they are not very informative in my case, the picture shows that the updated versions are more sensitive to the data (less smooth, more peaks).

dselivanov commented 6 years ago

I'm really sorry that it takes so long to review. I will have time on Saturday or Sunday.

manuelbickel commented 6 years ago

Ouch, sorry for the wrong style, its seems it will still take some time to turn down the bad old habit of using <- and until I automatically use = ;-). Have changed it, thank you for reviewing.