ericproffitt / TopicModelsVB.jl

A Julia package for variational Bayesian topic modeling.
Other
81 stars 8 forks source link

Features/add coherence measure #41

Open mrufsvold opened 2 years ago

mrufsvold commented 2 years ago

Add a coherence measure function to modelutils.jl that implements a modified UMass Coherence Score based on the findings "Exploring the Space of Topic Coherence Measures" (Röder, Both, & Hinneburg, 2015).

At a high-level, the function takes the following steps: 1) Get keys for topic words 2) Make one2previous token pairs for each topic (for each token, create tuples with all other tokens in the topic with a higher rated salience) 3) For each pair, calculate confirmation measure: (P(prev|current) - P(prev|!current)) / (P(prev|current) + P(prev|!current)) 4) Take the mean of these confirmation measures.

Röder, Both, & Hinneburg find even better performance on corpora with large documents using indirect confirmation measures (comparing the relationship between each word in the pair and all other words in the vocab). However, for my application, I have small docs so it wasn't worth any small gains I might get. I would consider taking this on later.

They also suggest using a sliding window (considering word 1 through 100 in the doc and then moving the window along the doc one word at a time) for as virtual docs. However, since this repo is already setup to use bag of words, I didn't want to introduce additional complexity in handling word order accurately.

For what it's worth -- I ran the models you created in the tutorial through the coherence measure using the training corpus (I didn't expand the corpus to a more general set of documents) and got a coherence of 0.330 and 0.315 for the LDA and fCTM respectively. I find it surprising that fCTM didn't come up with higher coherence than LDA but using the training corpus not a more general reference for coherence could be the source of the problem.

mrufsvold commented 2 years ago

I got some feedback from a colleague on my code and refactored it in light of his suggestions. Basically, changed the make_word_pairs() function from creating a large array to returning a channel that returns one tuple at a time. Significantly reduced allocations.

I also added a parallel version of findcoherence() but it depends on having Julia 1.7 because it uses @atomic. I have a question posted on the Discourse page to see if anyone has a more backwards compatible version of this. I'll update it if I get a suggestion.

I also found one logical mistake I made and addressed that. I need to rerun the scores for the Readme, but I won't have time for that until next week.

mrufsvold commented 2 years ago

Okay, I spent some time to better understand channels and I created a threadsafe version of accumulation without using @atomic so it should be backwards compatible.

ericproffitt commented 2 years ago

Thanks @mrufsvold. I've been a bit busier than usual, so I haven't had time to review your PR, however I hope to get to it at my earliest convenience.

mrufsvold commented 2 years ago

Don't mean to rush you! I hope life will slow down for you. Please don't worry about this PR if you need some rest too!

ericproffitt commented 2 years ago

I made some minor changes, mostly syntactic, however buffer is not defined in the findcoherence function, so this needs to be fixed before we can proceed.

mrufsvold commented 2 years ago

Sorry for the delay. Thanks for the catch!

mrufsvold commented 2 years ago

Sorry for messing up your formatting! I thought I had done that, and was trying to make it more standard-ish.

ericproffitt commented 2 years ago

Going forward, I'm trying to enforce a comment syntax of ## comment..., so I reverted this part of your commit back to what I had before.

Also, there are some edge case bugs which occur when the corpus vocabulary is very small, or when it contains terms which do not exist in any of the documents, these will also need to be addressed prior to merging the pull request. The cause of this appears to be division by zero in the calculate_confirmation function, and at the bottom of the findcoherence function, which result in the output of NaNs from the findcoherence function.

For example, model = LDA(Corpus(), 10) is a model with an empty corpus, and produces a NaN when one runs findcoherence(model). Similarly, both models

corp = readcorp(:citeu); corp.docs = corp[1:3]; fixcorp!(corp); model = LDA(corp, 10)
# findcoherence(model) = NaN
model = LDA(Corpus(Document(1:2), vocab=["term1", "term2"]), 10)
# findcoherence(model) = NaN

produce NaN values. In the empty case, I think the expected output of findcoherence should probably be 0, however in the latter cases (and other cases similar to them) I'm not sure what the appropriate functionality is.