Closed EvanDufraisse closed 2 years ago
Hi @EvanDufraisse Well, there is actually a subtle difference between the two. Sorry for the confusion due to the poor documentation of this.
Generally, topic models such as LDA have two likelihood terms for modeling: one of doc-topic distribution and the other of topic-term distribution. Therefore the total log-likelihood is calculated as a sum of log-likelihood for doc-topic (aka ll for docs) and log-likelihood for topic-term (aka ll for topics).
doc.get_ll()
returns only log-likelihood of doc-topic for given doc.
But, mdl.infer()
returns a sum of log-likelihood of doc-topic and difference between log-likelihood of topic-term before inference and the one after inference.
I think this is confusing enough too. I will improve the documentation for this and add an optional argument that selects a type of log-likelihood to be returned (only doc-topic
, only topic-term
, or both
).
You can see the details at internal code of mdl.infer()
below:
https://github.com/bab2min/tomotopy/blob/20a44b03dc67b90730edfb9e21623c9bed9f17ee/src/TopicModel/LDAModel.hpp#L914
https://github.com/bab2min/tomotopy/blob/20a44b03dc67b90730edfb9e21623c9bed9f17ee/src/TopicModel/LDAModel.hpp#L934-L936
Thanks for your quick reply @bab2min ! It now makes perfectly sense to me. I suppose such an option could indeed be of great use, and could in the same time inform the users of the existence of those two types of ll terms. Thanks again for your great work.
Hello,
I'd like first to thank you for sharing this wonderful library !
To monitor my experiments, I'd like to use a validation set to control the convergence of an LDA model.
I don't understand why after inference over a validation dataset, the returned log-likelihoods per document aren't the same whether I consult them using doc.get_ll() or using the array of log-likelihoods.
Illustration:
results, ll_per_doc = mdl.infer(validation_corpus)
ll_per_doc_from_results = [doc.get_ll() for doc in results]
I would get a
ll_per_doc_from_results
that is not different from a constant factor toll_per_doc
, with on average a factor around 4 forll_per_doc / ll_per_doc_from_results
Thanks for clarifying the difference between both approaches, I'm not sure to understand the src code for that part.