MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.21k stars 147 forks source link

Checking the documents inside of the topic #104

Closed StrangeFate closed 2 years ago

StrangeFate commented 2 years ago

Hi. I came up with some question regards about the documents in the topic model.

Basically, I'd like to know whether there will be a way to extract the original documents(preprocessed one is fine if it's impossible to extract un-preprocessed one) from a built topic model.

For example, let's say there are thousands of documents and we did topic modeling on these documents. From the result, there is topic A and what I want to do is, I want to see all the documents that are assigned as topic A. So that I can understand documents more deeply than using only keywords to understand that topic.

Thanks for reading!

vinid commented 2 years ago

Hello!

You can use the get_predicted_topics method to do this :) You should get a list of the predicted topic for each document

StrangeFate commented 2 years ago

Thank you @vinid

I just tried what you suggested me to use and I'm little confused.

In the description of get_predicted_topics, it saids it needs an input of CTM dataset. I wonder what is the CTM dataset in this context.

Also, I tried to use training_dataset as an input dataset of get_predicted_topics and I think it worked since it returns number of topic. In this case, will arrange of the returned topic numbers matches directly to the index of preprocessed_documents(unpreprocessed_documents too)?

Thanks for the quick reply and good work!

vinid commented 2 years ago

exactly! you should pass the data you used to train the model (training_dataset) to it. You can then align it to the original dataset you have :)

StrangeFate commented 2 years ago

exactly! you should pass the data you used to train the model (training_dataset) to it. You can then align it to the original dataset you have :)

Thank you! This works out very well!.

Closing issue!