Closed manuelbickel closed 6 years ago
Was not sure were the right place is to post interim experience with coherence
, therefore, I just post it here in the context of this PR.
In my case it seems that the "simpler" metrics npmi/pmi/difference
are the informative metrics regarding selection of the number of topics (the fit of the metrics to human judgement is, of course, another thing to be considered separately) - see this picture, i.e., loess smoothed coherence scores and loglik (abbreviations: ext: external corpus; int: internal corpus; ws: window size). Main corpus consists of about 30000 scientific abstracts in the field of "sustainable energy" and reference corpus for coherence are about 2000 Wikipedia articles that have potentially high thematic fit (semi-manually compiled via WikipediR
package). More complex metrics might perform better with better reference corpus, hence, I do not want to give a final judgement about the suitability of the various metrics in general.
Apart from comparing different metrics, the picture also shows the difference between the current implementation of the metrics using cosim
against the proposed updated versions in this PR. Although they are not very informative in my case, the picture shows that the updated versions are more sensitive to the data (less smooth, more peaks).
I'm really sorry that it takes so long to review. I will have time on Saturday or Sunday.
Ouch, sorry for the wrong style, its seems it will still take some time to turn down the bad old habit of using <-
and until I automatically use =
;-). Have changed it, thank you for reviewing.
Hi Dmitriy, sorry for sending a PR just after you have merged the first one. I have realized that we lost something during the revision steps.
The default input
tcm
only has entries inupper.tri
(and if added manually also indiag
). In the first version of coherence I had some code to make matrix symmetric, which was not needed for majority of metrics, therefore we removed it. However, for the two indirect measures usingcosim
(which I only added later) this matters. Since first theNPMI
of each top word with each other top word has to be calculated, we need a fully symmetric tcm (or subsets thereof) with entries also inlower.tri
. I have added a single line to cover this in the respective metrics (lines 336 and 360). and further clarified that the input tcm has to be a "upper.tri plus diag tcm" (line 89). I hope you agree to this logic.Another thing I changed is in the example in documentation regarding how the number of skip gram windows is calculated (line 164). I hope I have understood it correclty now...
Furthermore, I have removed some typos in documentation.