Open Sixy1204 opened 2 years ago
Hello @Sixy1204 What does the word embedding vectors in topic space mean? After training, LDA model generates document-topic distributions and topic-word distributions. If you want to get topic distributions for each word, you can get it as follows:
mdl = tp.LDAModel(...)
# add_doc and train...
topic_word_dists = np.array([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
# now `topic_word_dists` is in the shape [k, v] where k is # of topics and v is # of vocabs.
words = mdl.used_vocabs
for i, word in words:
print(word, topic_word_dists[:, i])
# print topic weights of each word.
# n.b. The sum of the topic weights for each word does not equal 1.
@bab2min I'm sorry i didn't make myself clear. I read a paper which mentioned the calculation method of vector representation of topic word. (As shown below) Could you please tell me where can I find the number of times that word w is assigned to topic z in training? thanks a lot
Hello @Sixy1204 What does the word embedding vectors in topic space mean? After training, LDA model generates document-topic distributions and topic-word distributions. If you want to get topic distributions for each word, you can get it as follows:
mdl = tp.LDAModel(...) # add_doc and train... topic_word_dists = np.array([mdl.get_topic_word_dist(k) for k in range(mdl.k)]) # now `topic_word_dists` is in the shape [k, v] where k is # of topics and v is # of vocabs. words = mdl.used_vocabs for i, word in words: print(word, topic_word_dists[:, i]) # print topic weights of each word. # n.b. The sum of the topic weights for each word does not equal 1.
I ran the code above, but the word probability looks really small.
@Sixy1204 You can normalize the value to make its sum to be 1 as follows:
topic_word_dists = np.array([mdl.get_topic_word_dist(k, normalize=False) for k in range(mdl.k)]) # turn off normalizing over word axis
topic_word_dists = topic_word_dists / topic_word_dists.sum(axis=0, keepdims=True) # normalize over topic axis
But it may be slightly different from the value suggested by the paper, because LDAModel.get_topic_word_dist
yields smoothed values of Dirichlet distributions. If you wanna the accurate value, you should count topic-word matrix manually.
Hi,
I was wondering if there's a way to get word embedding vectors in topic space after training tomotopy LDA model?
Thank you for your amazing work~