Closed rudra0713 closed 1 year ago
With such small number of documents, there are a couple of things you can do. First, if you do not plan on using the topic model for prediction, it might be worthwhile to label the articles yourself and pass them to BERTopic. This is called manual topic modeling. Second, you can use a cluster model that does not support generating outliers, such as k-Means instead. BERTopic allows for, in theory, any cluster model to be used, so playing around with that might help. Third, if you are generating topic representations from such a small sample, then it might be worthwhile to use an additional representation model, like KeyBERTInspired, that handles smaller documents a bit better.
Thanks for your comment @MaartenGr.
In my work, I may get 10-30 documents related to a variety of topics such as climate change, vaccination, college education, etc. My goal is to use a topic modeling approach (like BertTopic) to group documents that share some topics (My plan was to use functions like get_document_info()
or get_representative_docs()
). Since the topics are not predetermined, I do not think, manual labeling will be of much help in my scenario.
I do not mind getting the outlier topic as long as I also get useful topic representation.
I tried using the KeyBertInspired model, however, it did not help.
With few documents, I would personally then opt for clustering algorithms that do not create outliers at all. Since you have very little information available, almost anything that can be added to inform the topic is often helpful. If you do not expect to more than 30 documents once, then manual topic modeling might be better suited as that takes much less time than figuring out which algorithm works best.
@rudra0713 You might check out TopicTuner which will effectively allow you to quickly run all the possible HDBSCAN min_cluster_size
and sample_size
combinations and then you can determine if any of them work for your use case.
Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!
Hi, I am trying to use BERTtopic on a small number of documents (say, ~20-30). These articles discuss different aspects of vaccinations/animal testing. Most of the time, the model returns zero topics (it returns -1 as topic_id which I believe is the outlier according to the documentation). Is there a minimum of documents that I have to feed to BERTtopic before it can find meaningful topics?