Closed kimkyulim closed 1 year ago
There are several strategies for reducing outliers that you can find in the FAQ. There are roughly three:
.reduce_outliers
Personally, I would advise going with .reduce_outliers
as it has a number of interesting strategies to use.
Hi @MaartenGr Based on the advice you gave, I tried removing the outliers.
There are 4 strategies in .reduce_outliers, so I am trying all. I'm trying to use the probabilities strategy, but I'm getting an error.
TypeError: 'numpy.float64' object is not iterable
It's showing up because I'm trying to iterate over a float64 object, can you help me with the problem?
from bertopic import BERTopic
topic_model = BERTopic(calculate_probabilities=True)
topic_model=topic_model.load("load_model")
topics, probs = topic_model.fit_transform(docs)
new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")
Could you share the entire output of the error? Also, did you make sure to install the latest version of BERTopic? You can check with from bertopic import __version__ as bt_version; bt_version
Hi @MaartenGr
My bertopic version is 0.14.1 But the issue is the same.
TypeError Traceback (most recent call last)
Strange, could you check the shape of probs
? I have a feeling that it is a flat list of probabilities and not the topic-document probability matrix.
Hi @MaartenGr
Thank you for letting me know about your progress with the topic modeling methods we discussed. I'm sorry to hear that the .reduce_outliers method did not work as expected and that the topic -1 ratio remained similar to the previous results.
Regarding your decision to use K-means as an alternative, setting the number of clusters to the number of topics obtained from bertopic sounds like a reasonable approach. However, I understand that you have concerns about the topic -1 ratio remaining similar to the previous one.
Questions:
The bertopic algorithm does use hdbscan to cluster topics, but when you switch to K-means, you are no longer using hdbscan. Instead, you are using K-means to cluster the document-topic matrix that bertopic produced. K-means and hdbscan are different clustering algorithms with different assumptions and characteristics. Is it possible to say that using k-means used Bertopic?
I'm looking for a way to set n_cluster, should I use it to set the number of k-means clusters like elbow method?
The bertopic algorithm does use hdbscan to cluster topics, but when you switch to K-means, you are no longer using hdbscan. Instead, you are using K-means to cluster the document-topic matrix that bertopic produced. K-means and hdbscan are different clustering algorithms with different assumptions and characteristics. Is it possible to say that using k-means used Bertopic?
You can indeed use a different clustering algorithm besides HDBSCAN, BERTopic was designed in such a way that it is for the most part modular so that users can use whatever they need for their specific use case. You can find more about clustering models here.
I'm looking for a way to set n_cluster, should I use it to set the number of k-means clusters like elbow method?
That is indeed an option but it depends on your use case. At times there might domain experts in your field that can help you out with your domain specific data or the use case might be to have abstract topics. Having said that, human evaluation, looking at the resulting topics yourself and make the judgement is one of the most important aspects.
Closing this due to inactivity. Let me know if I need to re-open the issue!
Hello, @MaartenGr
I have been using the bertopic algorithm and you have noticed that the number of documents classified as -1 topic is quite high, ranging from 30% to 50% of the total documents. I would like to know if there is a way to reduce the number of documents classified as -1 topic to around 10%.
Thank you for your answer