Closed manuelbickel closed 6 years ago
Hi Manuel. CalcProbCoherence
does not implement the measure proposed by Mimno et al. (http://dirichlet.net/pdf/mimno11optimizing.pdf) Instead, it is a measure that I developed. (I haven't yet written it up, but it will be part of my PhD dissertation. That'll be sometime in the next two years.)
Mimno's measure suffers from promoting topics full of words that are very frequent but statistically-independent of each other. For example, suppose you have a corpus of articles from the sports section of a newspaper. A topic with the words {sport, sports, ball, fan, athlete} would look great under Mimno's measure. But we actually know that it's a terrible topic because the words are so frequent in this corpus as to be meaningless. In other words, they are highly correlated with each other (what the Mimno measure captures) but they are statistically-independent of each other.
Probabilistic coherence corrects for this. For each pair of words {a, b} in the top M words in a topic, probabilistic coherence calculates P(b|a) - P(b), where {a} is more probable than {b} in the topic.
Here's the logic: if we restrict our search to only documents that contain the word {a}, then the word {b} should be more more probable in those documents than if chosen at random from the corpus. P(b|a) measures how probable {b} is only in documents containing {a}. P(b) measures how probable {b} is in the corpus as a whole. If {b} is not more probable in documents containing {a}, then the difference P(b|a) - P(b) should be close to zero.
The lines of code you highlighted are doing this calculation across the top M words. For example, suppose the top 4 words in a topic are {a, b, c, d}. Then, we calculate
And all 6 differences are averaged together, giving the probabilistic coherence measure.
Hope that helps. Let me know if you need me to explain anything more. Fun fact: I've run simulations and found that probabilistic coherence is great at selecting the optimal number of topics in a corpus. And, yes, I need to write all this up in a research paper.
Hi Tommy,
thank you for your detailed answer and explanations about your new approach to topic coherence. I think, I got the main point about measuring the statistical dependence instead of only the correlation. Please correct me if I am wrong, but it follows a line of thinking like the pointwise mutual information (PMI) measure for collocations, of course, with a wider context and more complex.
Maybe I forgot to mention that my project is also part of a PhD thesis. Hence, I am planning to write a research paper with a focus on the content of papers in the field of "energy" (not on methodology of text mining). Therefore, I wanted to know if your approach is documented somewhere else than in your package (and now in this thread) that might serve as a potential source for citation. I know you already said "yes, I need to write all this up in a paper" but I wanted to ask anyway. Otherwise reviewers will simply have to check your code ;-).
Just realized that my comment concerning PMI was very uninformed after reading some more, since most coherence measures make use of the basic PMI concept but differ in how they use it. Sorry for asking before reading.
Don't be so hard on yourself. But, yes, PMI is getting at the same thing. It has the same probabilities. i.e. PMI = log(P(b|a)/P(b)). If {a} and {b} are statistically-independent, then PMI will be zero. I prefer my measure as it is bounded by -1 and 1 whereas PMI goes from negative infinity to infinity. So, IMHO, probabilistic coherence is easier to interpret and compare to other contexts.
And, unfortunately, probabilistic coherence isn't documented anywhere. You can cite the textmineR
package itself, however. (Hell, be an academic rebel and cite this thread!)
I sat down to write up probabilistic coherence a few years ago. I quickly realized that, lacking a statistical theory, there was no global and objective way to say one coherence measure was better than the other. So, I set about solving that problem and the coherence paper never got written. (And the statistical theory will be the bulk of my dissertation.)
Thanks for the quick and nice answer. Since the largest amount of time is consumed by fitting models for different number of topics, I think it might be worth to calculate all of the simple intrinsic coherence measures along the way that I can implement (x-fold-perplexity-cross-validation, UMass and your ProbCoherence) and also save some of the topic word lists to check which of the measures points to the most reasonable topics - I guess yours. On this basis, I can validate the measure (not statistically but on the basis of expertise in my field) and cite textmineR
and this thread. I will let you know about the results. Thank you again for your time and explanations.
Sounds good. I'll close the issue for now. If anything else comes up, please just comment here and we can re-open. It'll all still be online for citation.
Hi Tommy & Manual, Interesting discussion. I am also a PhD student, in the Information Systems (IS) stream and looking for an intrinsic topic coherence measure for part of my data analysis. Therefore, I am looking for R package (or implementation) for UMass. I came across the probabilistic coherence via textmineR package. Any one of you have documented the probabilistic coherence measure by now? If so I'd appreciate if you could share me the reference. Thanks!
Hi, we have just implemented several metrics for coherence in the text2vec package (dev version), incl. UMass. You will also find the latter implemented in stm package as "semanticcoherence". We have also included the difference measure proposed here in textmineR, however, in a vectorized form (might be interesting for Tommy from a programming perspective).
From a theoretical perspective you might be interested by the paper by Röder et al. " Exploring the space of topic coherence..." summing up and comparing various metrics. The authors have imemented a very wide range of metrics in the Java program "palmetto" (you will find it on github).
Best regards!
Am 13. Juni 2018 13:13:46 MESZ schrieb sweetmals notifications@github.com:
Hi Tommy & Manual, Interesting discussion. I am also a PhD student, in the Information Systems (IS) stream and looking for an intrinsic topic coherence measure for part of my data analysis. Therefore, I am looking for R package (or implementation) for UMass. I came across the probabilistic coherence via textmineR package. Any one of you have documented the probabilistic coherence measure by now? If so I'd appreciate if you could share me the reference. Thanks!
-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/TommyJones/textmineR/issues/35#issuecomment-396901515
-- sent via mobile - please excuse typos
Thanks Manuel. I already had a look at the stm package 'topic_coherence' function. I will have a look at the text2vec package. Thanks again for the very prompt reply.
Hi, guys. @sweetmals, to answer your question: probabilistic coherence is now documented in one of the vignettes. GitHub version here: https://github.com/TommyJones/textmineR/blob/master/vignettes/c_topic_modeling.Rmd CRAN version here: https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html
However, I messed up the probabilities in the description. (Instead of "P(a|b) - P(b)", it should read "P(b|a) - P(b)", for example.) I've opened the issue here: #38
@manuelbickel did just implement a bunch of coherence measures in text2vec. He also cited a paper comparing various topic coherence measures that's worth checking out: https://pdfs.semanticscholar.org/03a0/62fdcd13c9287a2d4e1d6d057fd2e083281c.pdf
The UMass measure performs poorly. They also use a measure that seems to be identical to probabilistic coherence. (Looks like their citation was 2007. I independently derived probabilistic coherence in 2013. So, I guess it's theirs. :-/)
Anyway, I hope this is helpful. I'll get to #38 this summer.
Hi Tommy, That's great, it is documented. Thanks a lot for the information. Indeed it is helpful as I am new to this area.
As an addition to the information Tommy has provided, here is the link to a version of the paper of Röder et al. that includes some more detailed information on the metrics used: https://doi.org/10.1145/2684822.2685324. Furthermore, I have learned that the idea of coherence and several metrics have already been discussed much earlier at a more abstract level and not in direct connection to text mining, etc., e.g., by Eels and Fitelson in "Symmetries and Asymmetries in Evidential Support" and several other authors before (the paper of Röder also refers to some of these authors, check the references).
I agree with Tommy that UMass performs poorly. I have just tested it with one dataset, yet, but does not seem to be a useful measure, at least for finding the optimum number of topics (maybe an adapted version might be better)...just as a side note...
Furthermore, I want to highlight that without the efforts of @TommyJones - thanks! - to implement the probabilistic difference metric I would not have been able to implement the other measures. His implementation really helped me to understand how such metrics can be programmed in R - I hope there won`t turn out too many mistakes ;-) Also, his explanations are very straightforward but not encrypted, so that especially beginners in the field can understand the general idea quite quickly.
Concerning text2vec, the implemented difference or UMass metric produces the same results as the stm or textmineR package respectively. I think for the other metrics there is no other implentation in R, yet, for cross-checking.
Thank you so much for the support. You guys are so nice :-) Any one of you have tested topic coherence measures using short texts (or tweets) and have any interesting insights or findings to share? I am working on a Twitter data set, so thought of asking as you two seems to be experts in the area. I read two papers on that "Topics in Tweets: A User Study of Topic Coherence Metrics for Twitter Data" and "Examining the Coherence of the Top Ranked Tweet Topics". But these papers mention different measures. Again thanks a lot for the information and help. I very much appreciate it.
My experience in general is that tweets (and other short texts) are inherently noisy. The result is models with low R-squared (also in textmineR) and lower coherence topics, on average. But that’s only my experience, not hard research. On Wed, Jun 13, 2018 at 9:43 PM sweetmals notifications@github.com wrote:
Thank you so much for the support. You guys are so nice :-) Any one of you have tested topic coherence measures using short texts (or tweets) and have any interesting insights or findings to share? I am working on a Twitter data set, so thought of asking as you two seems to be experts in the area. I read two papers on that "Topics in Tweets: A User Study of Topic Coherence Metrics for Twitter Data" and "Examining the Coherence of the Top Ranked Tweet Topics". But these papers mention different measures. Again thanks a lot for the information and help. I very much appreciate it.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/35#issuecomment-397140167, or mute the thread https://github.com/notifications/unsubscribe-auth/AEwoAnigOZlb3EUOdRKqsFSapmBtTgM2ks5t8b-sgaJpZM4QojH8 .
-- I am responsible for the concept of this message. Unfortunately, autocorrect is responsible for the content
Thanks Tommy.
Hi Tommy (@TommyJones ) & Manuel (@manuelbickel )
Sorry for troubling you guys a lot :-(
I just had a look at text2vec coherence implementation (https://github.com/dselivanov/text2vec/blob/master/R/coherence.R).
This might be a stupid question given that lack of my knowledge on this area. Which metric is based on the UMass out of the above list? I plan to use both Tommy's metric in textmineR and UMass as an experiment.
Also what is the difference between 'distance' metrics (Cosine, Jaccard, Euclidean etc.) vs 'coherence' metrics? Although coherence is used to measure quality of topics models, isn't it the same thing (i.e. measuring how similar or how distant topics/terms are)?
Highly appreciate if you guys could share your thoughts on this. Thanks in advance.
Hi Guys,
I found that 'coherence_mean_logratio' is the one which implements UMass. So my stupid question is answered by myself :-) Would be great if you guys could share your thoughts on my second question above apart from the different implementation approaches. Thanks!
Coherence metrics are different from the other metrics you mentioned. For example, "mean_npmi_cosim" takes the top words of a topic and calculates the normalized pointwise mutual information between each word pair on basis of a tcm created from a training corpus, which gives you a vector for each word, and only then cosine is used to calculate the angle enclosed to the vector that represents the sum of all vectors. Hence, for the coherence metrics you usually use the top words of a topic and calculate some statistics on their mutual occurrence in a given corpus (this may be an external or internal corpus depending on the metric). Depending on the output you can do further calculations. In above example you might directly use the NPMI or do further calculations on it as done with the cosim. Then you aggregate the scores and this is your coherence.
The cosine similarity,e.g., "only" calculates the angle between two vectors. A coherence metric does not directly compare the topics, it calculates a score for a topic independently from the other topics. Hence, it is not comparing word vectors but lists of words. The cosine may or may not play a role in the calculation, depending if you turn each top word into some kind of vector representation or not.
Regarding the other metrics euclidean, jaccard, etc. I would kindly refer, e.g. to wikipedia and other common resources, i think these metrics are documented sufficiently there. In general, they are also ways of comparing the distance between vectors (not lists of words in the first step).
I hope that helps - and i hope Tommy agrees :-)...
Just as a side note, the "mean_difference" metric in text2vec is the same as Tommy's metric, at least it should be ;-) (please let me know if you encounter inconsistencies).
Am 15. Juni 2018 09:10:17 MESZ schrieb sweetmals notifications@github.com:
Hi Guys,
I found that 'coherence_mean_logratio' is the one which implements UMass. So my stupid question is answered by myself :-) Would be great if you guys could share your thoughts on my second question above apart from the different implementation approaches. Thanks!
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/TommyJones/textmineR/issues/35#issuecomment-397533786
-- sent via mobile - please excuse typos
Thanks a lot for the detailed explanation Manuel. It is indeed helpful. Let you know, if I run into any issues. Cheers!
Hi Guys, FYI, not sure you whether you both have seen this. Just to update you both that there is another package called 'SpeedReader' and an implementation for 'topic_coherence' (if you haven't seen this). (https://www.rdocumentation.org/packages/SpeedReader/versions/0.9.1/topics/topic_coherence) (https://github.com/matthewjdenny/SpeedReader/blob/master/R/topic_coherence.R)
Thanks, I have seen this. To my knowledge it also uses the logratio measure (i.e. UMASS), however, not very efficiently programmed. Please correct me, if I am wrong.
Von: sweetmals [mailto:notifications@github.com] Gesendet: Freitag, 15. Juni 2018 14:16 An: TommyJones/textmineR textmineR@noreply.github.com Cc: manuelbickel manuel.bickel@posteo.de; Mention mention@noreply.github.com Betreff: Re: [TommyJones/textmineR] Question: Which coherence measure used? Useful for topic model validation? (#35)
Hi Guys, FYI, not sure you whether you both have seen this. Just to update you both that there is another package called 'SpeedReader' and an implementation for 'topic_coherence' (if you haven't seen this). (https://www.rdocumentation.org/packages/SpeedReader/versions/0.9.1/topics/topic_coherence) (https://github.com/matthewjdenny/SpeedReader/blob/master/R/topic_coherence.R)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/35#issuecomment-397602707 , or mute the thread https://github.com/notifications/unsubscribe-auth/ASaYqi4uJRsH356YOFuIjsvMmNGS8VWBks5t86VugaJpZM4QojH8 .
As far as I understood by the code, it considers word frequencies in document term matrix not the values of beta in the topic model (logarithmized parameters of the word distribution for each topic).
The dtm is used to generate the intrinsic tcm (boolean document co-occurence of terms) during calculation of the coherence given a list of top words (topic), similar as it is done in textmineR or text2vec.
Am 15. Juni 2018 15:16:44 MESZ schrieb sweetmals notifications@github.com:
As far as I understood by the code, it considers word frequencies in document term matrix not the values of beta in the topic model (logarithmized parameters of the word distribution for each topic).
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/TommyJones/textmineR/issues/35#issuecomment-397616737
-- sent via mobile - please excuse typos
This conversation is reminding me: I've used probabilistic coherence/mean difference for selecting the number of topics using monte carlo simulated data. It is spot on at selecting the correct number of topics. It would be interesting to do the same comparing all of these metrics. I think I'll throw that in as a chapter to the dissertation if one of you guys doesn't beat me to it. :-P
On Fri, Jun 15, 2018 at 9:25 AM manuelbickel notifications@github.com wrote:
The dtm is used to generate the intrinsic tcm (boolean document co-occurence of terms) during calculation of the coherence given a list of top words (topic), similar as it is done in textmineR or text2vec.
Am 15. Juni 2018 15:16:44 MESZ schrieb sweetmals <notifications@github.com
: As far as I understood by the code, it considers word frequencies in document term matrix not the values of beta in the topic model (logarithmized parameters of the word distribution for each topic).
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/TommyJones/textmineR/issues/35#issuecomment-397616737
-- sent via mobile - please excuse typos
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/35#issuecomment-397619050, or mute the thread https://github.com/notifications/unsubscribe-auth/AEwoArJHA93vegl3QZTWloRTRb5i8tYmks5t87XHgaJpZM4QojH8 .
-- I am responsible for the concept of this message. Unfortunately, autocorrect is responsible for the content
Okay, now I get the approach. I started from the stm 'semanticCoherence' which has used beta. Thanks, I will hopefully get a better understanding by the time I go through textmineR and text2vec.
Ha ha Tommy, please go ahead, do it soon and publish your dissertation, so that I can easily refer to it :P
@TommyJones I am currently conducting a study on about 30k scientific abstracts and trying to use the coherence metrics for finding a suitable number of topics. Please have a look at the following image, which is a first step into this direction. I have normalized the various scores between 0 and 1 to make the scores plottable in one plot. The plot currently only shows loess smoothed values for a quick overview, the plot rather was an initial test.
In the final plot (analysis) I will use the local extrema. Some metrics opt for very few n_topics of about 50 topics (NOT visible in above plot) but have another peak at some higher n_topics (visible in above plot), which is lower but still a significant peak. Plotting the local extrema will highlight models of potential interest apart from the one with the highest coherence score. I will still have to investigate this, but I my feeling was that for 30k abstracts, 50 topics seemed to few, whereas the “second” lower peak intuitively seemed more reasonable.
I will still have to investigate the topics favored by individual metrics at the “first” and “second” peak. I will share my experience as soon as I am through with my study. Just thought that this might be of interest to you - not in the sense of a competition but for mutual learning ;-).
Von: Tommy Jones [mailto:notifications@github.com] Gesendet: Freitag, 15. Juni 2018 15:33 An: TommyJones/textmineR textmineR@noreply.github.com Cc: manuelbickel manuel.bickel@posteo.de; Mention mention@noreply.github.com Betreff: Re: [TommyJones/textmineR] Question: Which coherence measure used? Useful for topic model validation? (#35)
This conversation is reminding me: I've used probabilistic coherence/mean difference for selecting the number of topics using monte carlo simulated data. It is spot on at selecting the correct number of topics. It would be interesting to do the same comparing all of these metrics. I think I'll throw that in as a chapter to the dissertation if one of you guys doesn't beat me to it. :-P
On Fri, Jun 15, 2018 at 9:25 AM manuelbickel notifications@github.com wrote:
The dtm is used to generate the intrinsic tcm (boolean document co-occurence of terms) during calculation of the coherence given a list of top words (topic), similar as it is done in textmineR or text2vec.
Am 15. Juni 2018 15:16:44 MESZ schrieb sweetmals <notifications@github.com
: As far as I understood by the code, it considers word frequencies in document term matrix not the values of beta in the topic model (logarithmized parameters of the word distribution for each topic).
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/TommyJones/textmineR/issues/35#issuecomment-397616737
-- sent via mobile - please excuse typos
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/35#issuecomment-397619050, or mute the thread https://github.com/notifications/unsubscribe-auth/AEwoArJHA93vegl3QZTWloRTRb5i8tYmks5t87XHgaJpZM4QojH8 .
-- I am responsible for the concept of this message. Unfortunately, autocorrect is responsible for the content
— You are receiving this because you were mentioned. Reply to this email directly, https://github.com/TommyJones/textmineR/issues/35#issuecomment-397621300 view it on GitHub, or https://github.com/notifications/unsubscribe-auth/ASaYqsr2lVl2ZzWEsZqlZAxuk57luNJUks5t87edgaJpZM4QojH8 mute the thread.
Hey Guys, Just a short question. Sorry to bother again =( Is create TCM function in both 'textmineR' and 'text2vec' equivalent to TermDocumentMatrix function in 'tm' package?
Also @TommyJones can I use 'Dtm2Tcm' function to convert a document term matrix created via 'tm' package (are there any consequences or inaccuracies doing that way)?
Thanks in advance.
@sweetmals, the TCM functions are not the same as TermDocumentMatrix in tm. A term document matrix is just the transpose of a document term matrix. (In fact, this distinction is kind of pointless in my mind. It's one of many conventions the NLP community has that make thing unnecessarily confusing.) A TCM is a term co-occurrence matrix. It's a square matrix that counts the number of times word i and word j occur together in some window. For example, if you set "skipgram_window" to 5. It will count the number of times that words i and j occur together within 5 words of each other.
The Dtm2Tcm function shows the number of documents in which words i and j co-occur together. It's just calculating t(dtm > 0) %*% (dtm > 0)
.
@manuelbickel I look forward to seeing where you go with your research. Anecdotally, I had a dataset, the DOJ's office of justice programs' grant database, that had ~50k documents and prob. coherence found that I had about 50 topics. When I looked at the abstracts, many of them had very similar wording as they were grants from the same programs. Meanwhile, prob. coherence incated that 300 - 500 topics would work a smaller corpus of ~10k abstracts from NIH grants. The range of language is much wider there.
Also, @sweetmals, if you want to convert a tm document term matrix to the type of DTM used by textmineR and text2vec, I have a deprecated (and removed) function for this. It no longer ships with textmineR, but you can find it here
Also, also, tidytext has several "cast" functions that may do the same.
Hi @TommyJones thank you very much. Yeah, I went through the code [t(dtm > 0) %*% (dtm)] and it took me a while to understand the difference as I am refreshing my math and stats now.
@manuelbickel & @TommyJones : I look forward you guys documenting your outcomes (methods and comparisons) either in a joint research paper or individually. This indeed an area where there is a gap and one could make a significant research contribution. It may be helpful for scholars (scholars from IS, business etc. who are interested in the application side) who want to apply the techniques/methods directly without putting too much effort. Topic modeling is gaining popularity in the IS field now as it is useful for discourse analysis in particular for theory building (which I am doing as part of my PhD thesis).
I can provide you with details on the code later, however, could you please delete your question here in the textmineR thread and ask it in text2vec again, because it is a pure text2vec question, thanks.
Am 19. Juni 2018 03:04:36 MESZ schrieb sweetmals notifications@github.com:
Hi @manuelbickel ,
Would be great if you could clarify me some of the code lines of coherence.R implementation.
tcm = as.matrix(tcm[top_terms_tcm, top_terms_tcm]) By this time you already have a filtered TCM corresponding to the top terms in the input x and original TCM itself. Correct me if I am wrong.
I am not clear what happens from the following lines within each topic. topic_i_term_indices = match(x[, i], terms_tcm)
remove NA indices - not all top terms for topic 'i' are necessarily
included in tcm topic_i_term_indices = topic_i_term_indices[!is.na(topic_i_term_indices)]
Isn’t it the same you do with taking the intersect of top_terms_unique and terms_tcm, then re-constructing the TCM by this line ‘tcm = as.matrix(tcm[top_terms_tcm, top_terms_tcm])’?
Please bear with me for my lack of knowledge.
Thanks in advance.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/TommyJones/textmineR/issues/35#issuecomment-398241747
-- sent via mobile - please excuse typos
Hi Tommy Jones,
I am approaching a topic modelling project based on scientific abstracts and have a question regarding the coherence measure you have thankfully implemented. Since I am not a computer scientist, I thought I´d ask before spending hours in trying to figure it out myself. I guess that you use the "UMass measure" proposed by Mimno et al., is this correct? I did not fully understand the lines 72-74 in CalcProbCoherence.R of the
pcoh
function , i.e.,result <- sapply(1:(ncol(count.mat) - 1), function(x) { mean(p.mat[x, (x + 1):ncol(p.mat)]/p.mat[x, x] - Matrix::diag(p.mat)[(x + 1):ncol(p.mat)], na.rm = TRUE)
.Could you give me a hint how these lines work / what they mean?
I hope I am not bothering you too much with this non-expert question. I would be glad if you could help me to improve my understanding.
Thanks in advance Manuel Bickel