adjidieng / ETM

Topic Modeling in Embedding Spaces
MIT License
538 stars 126 forks source link

Topic Coherence Computation: Division by 45? #24

Closed mona-timmermann closed 1 year ago

mona-timmermann commented 3 years ago

Why are they dividing by 45 for topic coherence based on normalised PMI? It says in the paper but the computation in the code looks different to me.

Screen Shot 2020-12-10 at 16 38 23
jfcann commented 3 years ago

Hi mona-timmermann, the reason for the 45 is that there are 45 ways of picking 2 distinct words from a list of 10 words. Equivalently, there are 45 (i, j) summation indices used in the TC equation above. You divide by 45 so that you have the average PMI.

yuyangstatistics commented 2 years ago

If we run the 'eval' mode, then the log file will show counter = 55. I think this is due to a tiny error in the get_topic_coherence() function: top_10 = list(beta[k].argsort()[-11:][::-1]). It should instead be top_10 = list(beta[k].argsort()[-10:][::-1]). After we change it, the counter will equal to 45.