Open mrufsvold opened 2 years ago
I got some feedback from a colleague on my code and refactored it in light of his suggestions. Basically, changed the make_word_pairs() function from creating a large array to returning a channel that returns one tuple at a time. Significantly reduced allocations.
I also added a parallel version of findcoherence() but it depends on having Julia 1.7 because it uses @atomic. I have a question posted on the Discourse page to see if anyone has a more backwards compatible version of this. I'll update it if I get a suggestion.
I also found one logical mistake I made and addressed that. I need to rerun the scores for the Readme, but I won't have time for that until next week.
Okay, I spent some time to better understand channels and I created a threadsafe version of accumulation without using @atomic so it should be backwards compatible.
Thanks @mrufsvold. I've been a bit busier than usual, so I haven't had time to review your PR, however I hope to get to it at my earliest convenience.
Don't mean to rush you! I hope life will slow down for you. Please don't worry about this PR if you need some rest too!
I made some minor changes, mostly syntactic, however buffer
is not defined in the findcoherence
function, so this needs to be fixed before we can proceed.
Sorry for the delay. Thanks for the catch!
Sorry for messing up your formatting! I thought I had done that, and was trying to make it more standard-ish.
Going forward, I'm trying to enforce a comment syntax of ## comment...
, so I reverted this part of your commit back to what I had before.
Also, there are some edge case bugs which occur when the corpus vocabulary is very small, or when it contains terms which do not exist in any of the documents, these will also need to be addressed prior to merging the pull request. The cause of this appears to be division by zero in the calculate_confirmation
function, and at the bottom of the findcoherence
function, which result in the output of NaNs from the findcoherence
function.
For example, model = LDA(Corpus(), 10)
is a model with an empty corpus, and produces a NaN when one runs findcoherence(model)
. Similarly, both models
corp = readcorp(:citeu); corp.docs = corp[1:3]; fixcorp!(corp); model = LDA(corp, 10)
# findcoherence(model) = NaN
model = LDA(Corpus(Document(1:2), vocab=["term1", "term2"]), 10)
# findcoherence(model) = NaN
produce NaN values. In the empty case, I think the expected output of findcoherence
should probably be 0, however in the latter cases (and other cases similar to them) I'm not sure what the appropriate functionality is.
Add a coherence measure function to
modelutils.jl
that implements a modified UMass Coherence Score based on the findings "Exploring the Space of Topic Coherence Measures" (Röder, Both, & Hinneburg, 2015).At a high-level, the function takes the following steps: 1) Get keys for topic words 2) Make one2previous token pairs for each topic (for each token, create tuples with all other tokens in the topic with a higher rated salience) 3) For each pair, calculate confirmation measure: (P(prev|current) - P(prev|!current)) / (P(prev|current) + P(prev|!current)) 4) Take the mean of these confirmation measures.
Röder, Both, & Hinneburg find even better performance on corpora with large documents using indirect confirmation measures (comparing the relationship between each word in the pair and all other words in the vocab). However, for my application, I have small docs so it wasn't worth any small gains I might get. I would consider taking this on later.
They also suggest using a sliding window (considering word 1 through 100 in the doc and then moving the window along the doc one word at a time) for as virtual docs. However, since this repo is already setup to use bag of words, I didn't want to introduce additional complexity in handling word order accurately.
For what it's worth -- I ran the models you created in the tutorial through the coherence measure using the training corpus (I didn't expand the corpus to a more general set of documents) and got a coherence of 0.330 and 0.315 for the LDA and fCTM respectively. I find it surprising that fCTM didn't come up with higher coherence than LDA but using the training corpus not a more general reference for coherence could be the source of the problem.