bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

Coherence crashing for 50 topics LDA model / 40k+ long documents (~20M total tokens) #191

Open Dijkie85 opened 1 year ago

Dijkie85 commented 1 year ago

Trying to compute c_v coherence for a 50 topic LDA model trained on 40k long documents (around 20M total tokens) takes about 15 minutes before crashing the kernel. Using gensim (via the great snippet provided in another issue) works just fine, takes about 2.5 minutes.

I'm running the following code on tomotopy 0.12.3 / python 3.10.8, adapted from the examples repo:

coh_model = Coherence(lda_model_50k, coherence='c_v')
average_coherence = coh_model.get_score()
print(average_coherence)

Any thoughts?