bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

Federated LDA #157

Closed mschweitzer96 closed 1 year ago

mschweitzer96 commented 2 years ago

Hi,

I want to use the tomotopy package to build a federated topic modeling protoype. I am training local LDA models on different clients, aggregate the topic-word distributions of all parties in a master model and broadcast aggregated topic-word distributions back to the clients for the next iteration.

In the next iteration I want to train the LDA models on the different parties based on the topic-word distributions that were broadcasted by the master. I can't find any way to initialize a LDA model based on a given topic word distribution with tomotopy.

In gensim I can initialize eta before training with a topic-word matrix of shape (num_topics, num_words) if I am not mistaken (https://radimrehurek.com/gensim/models/ldamodel.html).

Is there any way to do sth like this with tomotopy?

bab2min commented 2 years ago

Hello, @mschweitzer96 Currently, tomotopy's LDAModel doesn't have a feature that you asked. But you can simulate the feature manually using set_word_prior method.

topic_word_mat = np.array(...) # matrix of shape (k, v) where k is num_topics, v is num_words
model = LDAModel(...)

# adding docs into the model using model.add_corpus() or model.add_doc()

for i, word in enumerate(mdl.used_vocab):
    model.set_word_prior(word, topic_word_mat[:, i] + eta) 
    # you may need to add small value to word prior for smoothing if it has zero

It is important to note that the order of vocabulary in used_vocab varies according to the order of adding documents. So be careful if correct prior is set for each word when you set word prior.

If the document set is fixed, the feature to initialize the LDAModel from the topic word matrix seems quite useful. It doesn't seem to integrate seamlessly with the existing implementation, but I'll see if there's a way to add that feature.