A question about the number of topic results generated without setting nr_topics

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.12k stars 763 forks source link

A question about the number of topic results generated without setting nr_topics #1249

Closed aligagag closed 1 year ago

aligagag commented 1 year ago

Hello MaartenGr, I did not set the parameter nr_ topics when using Bertopic to process my data (30000 entries). In the end, 512 topics were obtained, but a lot of data (10000 items) were classified as topics labeled -1. However, upon manual inspection, it can be found that many of the topics labeled -1 belong to other topics.I would like to ask you: What parameters can be adjusted to solve this problem? Or rather, what steps or parameters affect the final number of topics without setting the nr_topics parameter?

Rosie2023Rosie commented 1 year ago

same happened with me total documents was about 30K, and the outliers was approximately 50% of the whole data.. and when I randomly check some documents from outliers I found many related documents.. any suggestion may help

MaartenGr commented 1 year ago

To reduce the number of outliers, I would refer you to the FAQ. There it mentions three strategies that you can use for reducing outliers in BERTopic (adjusting HDBSCAN params, using .reduce_outliers in BERTopic, and using a different clustering algorithm).

aligagag commented 1 year ago

Very, very helpful, thank you very much

Rosie2023Rosie commented 1 year ago

Thanks alot