dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Output hyper-parameters for LDA model #290

Closed leungi closed 5 years ago

leungi commented 5 years ago

This is a question on the LDA implementation.

In textmineR package, the LDA model outputs theta P(topic|document), phi P(token|topic), and gamma P(topic|token).

In text2vec, theta can be obtained from LDA$fit_transform().

Question: I presume LDA$topic_word_distribution give gamma; is it possible to extract phi - P(token|topic)?

Thanks!

TommyJones commented 5 years ago

I would actually presume that LDA$topic_word_distribution gives phi. That's calculated along the way for most implementations, including text2vec's WarpLDA. It takes post-processing (basically Bayes' Rule) to get gamma.

dselivanov commented 5 years ago

LDA$components gives you unnormalized topic-word counts. You can calculate P(token|topic) or P(topic|token) just by normalizing this matrix by row or column (making each row or column have unit L1 norm - simply divide each element in row/column by sum(row/column)).

leungi commented 5 years ago

Thank you both for the prompt assistance!

@TommyJones: you're right :+1:

@dselivanov: hope I understand you right... P(token|topic) = LDA$topic_word_distribution = normalize(LDA$components, 'l1') P(topic|token) = normalize(t(LDA$components), 'l1')

When I compared the LDA output (using 1st 500 rows from movie_review data) of text2vec and topicmodels packages, I noticed significant difference (with seed fixed).

For the word wild, topicmodels assigned relatively significant to 6 topics, while text2vec only assigned it to 2 topics. I presume this is due to varying implementations of LDA?

dselivanov commented 5 years ago

P(token|topic) = LDA$topic_word_distribution = normalize(LDA$components, 'l1') P(topic|token) = normalize(t(LDA$components), 'l1')

yes.

When I compared the LDA output (using 1st 500 rows from movie_review data) of text2vec and topicmodels packages, I noticed significant difference (with seed fixed).

Not sure about the difference. Try mb underfitting/overfitting/different hype-parameters(priors). Try to check perplexity of both models.