topic distribution vector of word

xinlux97 commented 4 years ago

Hi, In the example notebook for Contextual Topic Modeling, we can get topic distribution for a document via distribution = ctm.get_thetas(training_dataset)[8] # topic distribution for the first document I'm wondering whether can we get topic distribution for a single word? More specifically, for a input text corpus can we obtain the vocabulary file which contains words and the corresponding topic representations?

silviatti commented 4 years ago

Hi! Contextualized Topic Models are based on either ProdLDA or LDA. If you already know how LDA works, you can obtain the probability of a word to belong to a given topic by inspecting the topic-word matrix. On the other hand, ProdLDA does not constrain the topic-word matrix to be a probability distribution matrix. That means that a cell of the matrix contains just a real-valued weight of the word w to belong to topic t.

We just added the method ctm.get_topic_word_matrix() such that you can easily access the topic-word matrix of a model (whether ProdLDA or LDA-based). The returned matrix has dimensions (number of topics, number of words in the vocabulary) and the index of each column corresponds to the index of the word in the vocabulary.
In practice, you can follow these instructions to get the topic-word matrix:

# train the model
ctm = CTM(input_size=len(handler.vocab), bert_input_size=512, ...)
# fit the model
ctm.fit(training_dataset) 
# get topic-word matrix 
ctm.get_topic_word_matrix()

Please, keep in mind that, if you use ProdLDA, the matrix will not be normalized. Based on the original paper of ProdLDA [1] and our experiments, we suggest using ProdLDA because it produces more coherent results.

Hope you find it useful ;) Best,

Silvia

[1] Akash Srivastava and Charles Sutton, Autoencoding Variational Inference for Topic Models (2017)

xinlux97 commented 4 years ago

Thanks very much for your perfect explanation!

I have 2 more questions:

Here is part of the code in the readme instructions for training a contextualized topic model:

    handler = TextHandler("documents.txt")
    handler.prepare() # create vocabulary and training data

    training_bert = bert_embeddings_from_file("documents.txt", "distiluse-base-multilingual-cased")
    training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)

    ctm = CTM(input_size=len(handler.vocab), bert_input_size=512, inference_type="combined", n_components=50)
    ctm.fit(training_dataset) # run the model

(1) I want to train a model on my own dataset which contains several files and each contains one article. I process my dataset into one file "documents.txt" such that each line in this file is an article. Is this in the right format as the "documents.txt" in above code?

(2)

The returned matrix has dimensions (number of topics, number of words in the vocabulary) and the index of each column corresponds to the index of the word in the vocabulary.

The vocabulary means handler.vocab, right?

vinid commented 4 years ago

Hello!

the answer is yes to both questions :)

I'm also adding another note as I find that this information is missing in our documentation (I'll update this). If you are experimenting with English data, you might be better off using bert-base-nli-mean-tokens instead of the multilingual distiluse-base-multilingual-cased.

In that case, you should update the code as follows:

training_bert = bert_embeddings_from_file("documents.txt", "bert-base-nli-mean-tokens")
ctm = CTM(input_size=len(handler.vocab), bert_input_size=768, inference_type="combined", n_components=50)

(BERT embeddings are 728-dimensional)

In general, the code should be able to support all the models described in the sentence transformer package

vinid commented 4 years ago

I'll close the issue now, if you have another question ping us again :)

MilaNLProc / contextualized-topic-models

topic distribution vector of word #8