bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

CTM: Topic Count Impossibly Large #202

Open tau-241 opened 1 year ago

tau-241 commented 1 year ago

I used a correlated topic model on a 4,500-document corpus to learn the type and frequency of topics. The results were very good, but unfortunately one of the topics (#14) has an impossible count more than double the number of documents: image

This library is easy to use and very fast/performant and I feel lucky to have found it, but I can't use the results when a known-to-be-common topic has an impossible count.

I tried HDPModel and got a similar result, where one topic (#6) had a count of almost 4x the number of documents: image

What caused the large counts? Did I make a mistake? Is there a way for me to get the topic distributions for each individual document?

Thank you!