MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Use BertTopic for small number of articles #1018

Closed rudra0713 closed 1 year ago

rudra0713 commented 1 year ago

Hi, I am trying to use BERTtopic on a small number of documents (say, ~20-30). These articles discuss different aspects of vaccinations/animal testing. Most of the time, the model returns zero topics (it returns -1 as topic_id which I believe is the outlier according to the documentation). Is there a minimum of documents that I have to feed to BERTtopic before it can find meaningful topics?

MaartenGr commented 1 year ago

With such small number of documents, there are a couple of things you can do. First, if you do not plan on using the topic model for prediction, it might be worthwhile to label the articles yourself and pass them to BERTopic. This is called manual topic modeling. Second, you can use a cluster model that does not support generating outliers, such as k-Means instead. BERTopic allows for, in theory, any cluster model to be used, so playing around with that might help. Third, if you are generating topic representations from such a small sample, then it might be worthwhile to use an additional representation model, like KeyBERTInspired, that handles smaller documents a bit better.

rudra0713 commented 1 year ago

Thanks for your comment @MaartenGr. In my work, I may get 10-30 documents related to a variety of topics such as climate change, vaccination, college education, etc. My goal is to use a topic modeling approach (like BertTopic) to group documents that share some topics (My plan was to use functions like get_document_info() or get_representative_docs() ). Since the topics are not predetermined, I do not think, manual labeling will be of much help in my scenario.

I do not mind getting the outlier topic as long as I also get useful topic representation.

I tried using the KeyBertInspired model, however, it did not help.

MaartenGr commented 1 year ago

With few documents, I would personally then opt for clustering algorithms that do not create outliers at all. Since you have very little information available, almost anything that can be added to inform the topic is often helpful. If you do not expect to more than 30 documents once, then manual topic modeling might be better suited as that takes much less time than figuring out which algorithm works best.

drob-xx commented 1 year ago

@rudra0713 You might check out TopicTuner which will effectively allow you to quickly run all the possible HDBSCAN min_cluster_size and sample_size combinations and then you can determine if any of them work for your use case.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!