MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Some problems when using topic_model.reduce_topics #1155

Closed scn0901 closed 1 year ago

scn0901 commented 1 year ago
  1. If the specified number of topics is passed in nr_topics, how to determine that the selected number is reasonable? Is there an index to evaluate the quality of the model with different number of topics, or can the number of topics only be judged according to the actual scene?
  2. If "auto" is passed in nr_topics, what should I do if there are still many remaining topics (originally: 151, now 102)? Is it possible to use "auto" repeatedly, that is to say topic_model.reduce_topics(docs, nr_topics='auto').reduce_topics(docs, nr_topics='auto').reduce_topics(docs, nr_topics='auto')...? If so, is it OK to run until the number of topics no longer decreases?
  3. Which method do you recommend for reducing topics, specifying the number of topics or "auto"?
  4. When I reduced many topics, I found that the new topic 0 contains a lot of documents, which is not what I want. How to solve the problem of "a large number of documents are clustered into topic 0 after topic reduction"?

Thank you for your reply!

MaartenGr commented 1 year ago

If the specified number of topics is passed in nr_topics, how to determine that the selected number is reasonable? Is there an index to evaluate the quality of the model with different number of topics, or can the number of topics only be judged according to the actual scene?

That are many diverse evaluation metrics for this but in essence, it depends on your use case. Some metrics, like topic coherence, are optimized for interpretability, whereas others, like clustering metrics, focus more on the structure of the clusters themselves. Defining what is important for your use case is important here and then translating that to a metric. OCTIS contains a number of interesting metrics.

If "auto" is passed in nr_topics, what should I do if there are still many remaining topics (originally: 151, now 102)? Is it possible to use "auto" repeatedly, that is to say topic_model.reduce_topics(docs, nr_topics='auto').reduce_topics(docs, nr_topics='auto').reduce_topics(docs, nr_topics='auto')...? If so, is it OK to run until the number of topics no longer decreases? Which method do you recommend for reducing topics, specifying the number of topics or "auto"?

You could do that but I would advise focusing on the clustering algorithm itself as that is where the original number of topics is selected. For example, if you are using HDBSCAN, you can use min_cluster_size to control the size of a topic.

Which method do you recommend for reducing topics, specifying the number of topics or "auto"?

I would actually first start with the clustering algorithm itself and tune that such that the resulting topic creation is satisfying to me.

When I reduced many topics, I found that the new topic 0 contains a lot of documents, which is not what I want. How to solve the problem of "a large number of documents are clustered into topic 0 after topic reduction"?

You could change reduction strategies and perform it manually but it might be worthwhile to focus on the clustering algorithm first.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!