bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

Different log-likelihoods per doc on inference #147

Closed EvanDufraisse closed 2 years ago

EvanDufraisse commented 2 years ago

Hello,

I'd like first to thank you for sharing this wonderful library !

To monitor my experiments, I'd like to use a validation set to control the convergence of an LDA model.

I don't understand why after inference over a validation dataset, the returned log-likelihoods per document aren't the same whether I consult them using doc.get_ll() or using the array of log-likelihoods.

Illustration: results, ll_per_doc = mdl.infer(validation_corpus) ll_per_doc_from_results = [doc.get_ll() for doc in results]

I would get a ll_per_doc_from_results that is not different from a constant factor to ll_per_doc, with on average a factor around 4 for ll_per_doc / ll_per_doc_from_results

Thanks for clarifying the difference between both approaches, I'm not sure to understand the src code for that part.

bab2min commented 2 years ago

Hi @EvanDufraisse Well, there is actually a subtle difference between the two. Sorry for the confusion due to the poor documentation of this.

Generally, topic models such as LDA have two likelihood terms for modeling: one of doc-topic distribution and the other of topic-term distribution. Therefore the total log-likelihood is calculated as a sum of log-likelihood for doc-topic (aka ll for docs) and log-likelihood for topic-term (aka ll for topics).

doc.get_ll() returns only log-likelihood of doc-topic for given doc. But, mdl.infer() returns a sum of log-likelihood of doc-topic and difference between log-likelihood of topic-term before inference and the one after inference.

I think this is confusing enough too. I will improve the documentation for this and add an optional argument that selects a type of log-likelihood to be returned (only doc-topic, only topic-term, or both).

You can see the details at internal code of mdl.infer() below: https://github.com/bab2min/tomotopy/blob/20a44b03dc67b90730edfb9e21623c9bed9f17ee/src/TopicModel/LDAModel.hpp#L914 https://github.com/bab2min/tomotopy/blob/20a44b03dc67b90730edfb9e21623c9bed9f17ee/src/TopicModel/LDAModel.hpp#L934-L936

EvanDufraisse commented 2 years ago

Thanks for your quick reply @bab2min ! It now makes perfectly sense to me. I suppose such an option could indeed be of great use, and could in the same time inform the users of the existence of those two types of ll terms. Thanks again for your great work.