bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
560 stars 63 forks source link

Is there a way to get topic vector? #163

Open Sixy1204 opened 2 years ago

Sixy1204 commented 2 years ago

Hi,

I was wondering if there's a way to get word embedding vectors in topic space after training tomotopy LDA model?

Thank you for your amazing work~

bab2min commented 2 years ago

Hello @Sixy1204 What does the word embedding vectors in topic space mean? After training, LDA model generates document-topic distributions and topic-word distributions. If you want to get topic distributions for each word, you can get it as follows:

mdl = tp.LDAModel(...)

# add_doc and train...

topic_word_dists = np.array([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
# now `topic_word_dists` is in the shape [k, v] where k is # of topics and v is # of vocabs.
words = mdl.used_vocabs
for i, word in words:
    print(word, topic_word_dists[:, i])
# print topic weights of each word. 
# n.b. The sum of the topic weights for each word does not equal 1.
Sixy1204 commented 2 years ago

@bab2min I'm sorry i didn't make myself clear. I read a paper which mentioned the calculation method of vector representation of topic word. (As shown below) 捕获 Could you please tell me where can I find the number of times that word w is assigned to topic z in training? thanks a lot

Sixy1204 commented 2 years ago

Hello @Sixy1204 What does the word embedding vectors in topic space mean? After training, LDA model generates document-topic distributions and topic-word distributions. If you want to get topic distributions for each word, you can get it as follows:

mdl = tp.LDAModel(...)

# add_doc and train...

topic_word_dists = np.array([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
# now `topic_word_dists` is in the shape [k, v] where k is # of topics and v is # of vocabs.
words = mdl.used_vocabs
for i, word in words:
    print(word, topic_word_dists[:, i])
# print topic weights of each word. 
# n.b. The sum of the topic weights for each word does not equal 1.

I ran the code above, but the word probability looks really small.

bab2min commented 2 years ago

@Sixy1204 You can normalize the value to make its sum to be 1 as follows:

topic_word_dists = np.array([mdl.get_topic_word_dist(k, normalize=False) for k in range(mdl.k)]) # turn off normalizing over word axis
topic_word_dists = topic_word_dists / topic_word_dists.sum(axis=0, keepdims=True) # normalize over topic axis

But it may be slightly different from the value suggested by the paper, because LDAModel.get_topic_word_dist yields smoothed values of Dirichlet distributions. If you wanna the accurate value, you should count topic-word matrix manually.