MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

How to Reduce the Number of Documents Classified as -1 Topic in Bertopic Algorithm? #1197

Closed kimkyulim closed 1 year ago

kimkyulim commented 1 year ago

Hello, @MaartenGr

I have been using the bertopic algorithm and you have noticed that the number of documents classified as -1 topic is quite high, ranging from 30% to 50% of the total documents. I would like to know if there is a way to reduce the number of documents classified as -1 topic to around 10%.

Thank you for your answer

MaartenGr commented 1 year ago

There are several strategies for reducing outliers that you can find in the FAQ. There are roughly three:

  1. Use .reduce_outliers
  2. Replace HDBSCAN with a non-outlier algorithm, like k-Means
  3. Fine-tune HDBSCAN's parameters

Personally, I would advise going with .reduce_outliers as it has a number of interesting strategies to use.

kimkyulim commented 1 year ago

Hi @MaartenGr Based on the advice you gave, I tried removing the outliers.

There are 4 strategies in .reduce_outliers, so I am trying all. I'm trying to use the probabilities strategy, but I'm getting an error.

TypeError: 'numpy.float64' object is not iterable

It's showing up because I'm trying to iterate over a float64 object, can you help me with the problem?

from bertopic import BERTopic
topic_model = BERTopic(calculate_probabilities=True)
topic_model=topic_model.load("load_model")
topics, probs = topic_model.fit_transform(docs)

new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")
MaartenGr commented 1 year ago

Could you share the entire output of the error? Also, did you make sure to install the latest version of BERTopic? You can check with from bertopic import __version__ as bt_version; bt_version

kimkyulim commented 1 year ago

Hi @MaartenGr

My bertopic version is 0.14.1 But the issue is the same.

This is entire output error


TypeError Traceback (most recent call last)

in 9 10 # Reduce outliers using the `probabilities` strategy ---> 11 new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities") C:\ProgramData\Anaconda3\lib\site-packages\bertopic\_bertopic.py in reduce_outliers(self, documents, topics, strategy, probabilities, threshold, embeddings, distributions_params) 1956 if strategy.lower() == "probabilities": 1957 new_topics = [np.argmax(prob) if max(prob) >= threshold and topic == -1 else topic -> 1958 for topic, prob in zip(topics, probabilities)] 1959 1960 # Reduce outliers by extracting most frequent topics through calculating of Topic Distributions C:\ProgramData\Anaconda3\lib\site-packages\bertopic\_bertopic.py in (.0) 1956 if strategy.lower() == "probabilities": 1957 new_topics = [np.argmax(prob) if max(prob) >= threshold and topic == -1 else topic -> 1958 for topic, prob in zip(topics, probabilities)] 1959 1960 # Reduce outliers by extracting most frequent topics through calculating of Topic Distributions TypeError: 'numpy.float64' object is not iterable
MaartenGr commented 1 year ago

Strange, could you check the shape of probs? I have a feeling that it is a flat list of probabilities and not the topic-document probability matrix.

kimkyulim commented 1 year ago

Hi @MaartenGr

Thank you for letting me know about your progress with the topic modeling methods we discussed. I'm sorry to hear that the .reduce_outliers method did not work as expected and that the topic -1 ratio remained similar to the previous results.

Regarding your decision to use K-means as an alternative, setting the number of clusters to the number of topics obtained from bertopic sounds like a reasonable approach. However, I understand that you have concerns about the topic -1 ratio remaining similar to the previous one.

Questions:

  1. The bertopic algorithm does use hdbscan to cluster topics, but when you switch to K-means, you are no longer using hdbscan. Instead, you are using K-means to cluster the document-topic matrix that bertopic produced. K-means and hdbscan are different clustering algorithms with different assumptions and characteristics. Is it possible to say that using k-means used Bertopic?

  2. I'm looking for a way to set n_cluster, should I use it to set the number of k-means clusters like elbow method?

MaartenGr commented 1 year ago

The bertopic algorithm does use hdbscan to cluster topics, but when you switch to K-means, you are no longer using hdbscan. Instead, you are using K-means to cluster the document-topic matrix that bertopic produced. K-means and hdbscan are different clustering algorithms with different assumptions and characteristics. Is it possible to say that using k-means used Bertopic?

You can indeed use a different clustering algorithm besides HDBSCAN, BERTopic was designed in such a way that it is for the most part modular so that users can use whatever they need for their specific use case. You can find more about clustering models here.

I'm looking for a way to set n_cluster, should I use it to set the number of k-means clusters like elbow method?

That is indeed an option but it depends on your use case. At times there might domain experts in your field that can help you out with your domain specific data or the use case might be to have abstract topics. Having said that, human evaluation, looking at the resulting topics yourself and make the judgement is one of the most important aspects.

MaartenGr commented 1 year ago

Closing this due to inactivity. Let me know if I need to re-open the issue!