MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

New topics reduce outliers instructions #529

Closed doubianimehdi closed 2 years ago

doubianimehdi commented 2 years ago

Hi Maarten,

Thanks again for your AMAZING work !!!

I have a question regarding this bit of code : import numpy as np probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

I know that I can use this new_topics variable in topics_over_time instead of topics ...

But topic_model.get_topic_info() is still giving the list of topics with the old topics ... how can you update or use this new feature , i'm a bit confused ...

Thank you again !

Regards

MaartenGr commented 2 years ago

import numpy as np probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

This code indeed does not change the internal structure of BERTopic. You can see that the topic model was not referenced here. In order to update your model with the new topics, you will have to perform the following:

import pandas as pd
import numpy as np

# Extract new topics
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

# Update the internal topic representation of the topics
# NOTE: You can skip this step if you do not want to change the topic representations
topic_model.update_topics(docs, new_topics)

# Update topic frequencies
documents = pd.DataFrame({"Document": docs, "Topic": new_topics})
topic_model._update_topic_size(documents)

._update_topic_size() has been separated from .update_topics() on purpose as updating topics was initially meant to update the topic representation only and not the corresponding frequencies. Doing so might make it less transparent as to what the model initially produced as output.

doubianimehdi commented 2 years ago

Thank you so much !

junwycresta commented 2 years ago

getting an error running

probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-90-6d7c55eb4de3> in <module>
      5 #probs
      6 probability_threshold = 0.01
----> 7 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

<ipython-input-90-6d7c55eb4de3> in <listcomp>(.0)
      5 #probs
      6 probability_threshold = 0.01
----> 7 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

TypeError: 'numpy.float64' object is not iterable
junwycresta commented 2 years ago

my numpty version is 1.21.6 I get the probs from

topics, probs = topic_model.fit_transform(texts)

MaartenGr commented 2 years ago

I get the probs from

topics, probs = topic_model.fit_transform(texts)

Ah, you should use the following instead:

topics, probs = topic_model.fit_transform(texts, calculate_probabilities=True)

The probs that you are using now are flat and only show the probability of the topic that the document was assigned to. To get the topic-document probability matrix, we have to set calculate_probabilities.

cb-pratibhasaha commented 2 years ago

Hello, Just curious to know why the probability threshold is taken to be 0.01 and not any other value? How is this threshold determined and does it change with the amount of data? I used this method to reduce outliers in my data but it ended up putting dissimilar docs into one. Do enlighten me.

Thanks!

MaartenGr commented 2 years ago

Just curious to know why the probability threshold is taken to be 0.01 and not any other value?

The probability threshold was arbitrarily chosen and served only as an indicator to show to choose one yourself.

How is this threshold determined and does it change with the amount of data?

It should not change that much with the amount of data but it might change a bit depending on the number of topics that you have. I would expect however that the changes would be very small and might not be that significant.

I used this method to reduce outliers in my data but it ended up putting dissimilar docs into one.

Just a quick note, you can also use k-Means instead if you want to remove outliers.

cb-pratibhasaha commented 2 years ago

Hi I tried using the KMeans clustering method using the following code:

from bertopic import BERTopic from sklearn.cluster import KMeans from umap import UMAP from hdbscan import HDBSCAN cluster_model = KMeans(n_clusters=100) umap_model = UMAP(random_state=42) topic_model_2 = BERTopic(hdbscan_model=cluster_model,vectorizer_model = vectorizer_model_cb,umap_model = umapmodel) topics, = topic_model_2.fit_transform(text)

I am getting the following error: AttributeError Traceback (most recent call last) /var/folders/0k/13bfm48j1rz0n2szd70546qh0000gp/T/ipykernel_73533/272709035.py in 6 umap_model = UMAP(random_state=42) 7 topic_model_2 = BERTopic(hdbscan_model=cluster_model,vectorizer_model = vectorizer_model_cb,umap_model = umapmodel) ----> 8 topics, = topic_model_2.fit_transform(text)

~/opt/anaconda3/envs/new_venv/lib/python3.7/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y) 299 300 # Cluster UMAP embeddings with HDBSCAN --> 301 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents) 302 303 # Sort and Map Topic IDs by their frequency

~/opt/anaconda3/envs/new_venv/lib/python3.7/site-packages/bertopic/_bertopic.py in _cluster_embeddings(self, umap_embeddings, documents) 1397 self.hdbscan_model.fit(umap_embeddings) 1398 documents['Topic'] = self.hdbscanmodel.labels -> 1399 probabilities = self.hdbscanmodel.probabilities 1400 1401 if self.calculate_probabilities:

AttributeError: 'KMeans' object has no attribute 'probabilities_'

Could you suggest some solutions to this?

MaartenGr commented 2 years ago

@cb-pratibhasaha From the looks of it, it seems that you are not using BERTopic v0.10 which was recently released to include models like k-Means. Upgrading BERTopic through pip install --upgrade bertopic should do the trick.

cb-pratibhasaha commented 2 years ago

This worked. Thank you so much!! Is there any algorithm to figure out the optimal number of clusters here as per my data or I do it by trial and error?

I do want to thank you for your continuous help in answering my questions. Thank you so much!

MaartenGr commented 2 years ago

@cb-pratibhasaha In the context of BERTopic, I would advise starting from the use case first and then exploring from there. In some cases, you might already know the number of topics you might find in the data. To an extent, it is indeed some trial and error figuring out the best number of topics for your use case. There are k-Means specific methods like the elbow method but that might not necessarily apply to this use case.

I do want to thank you for your continuous help in answering my questions. Thank you so much!

No problem! Glad I can be of help.

rubypnchl commented 2 years ago

import numpy as np probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

This code indeed does not change the internal structure of BERTopic. You can see that the topic model was not referenced here. In order to update your model with the new topics, you will have to perform the following:

import pandas as pd
import numpy as np

# Extract new topics
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

# Update the internal topic representation of the topics
# NOTE: You can skip this step if you do not want to change the topic representations
topic_model.update_topics(docs, new_topics)

# Update topic frequencies
documents = pd.DataFrame({"Document": docs, "Topic": new_topics})
topic_model._update_topic_size(documents)

._update_topic_size() has been separated from .update_topics() on purpose as updating topics was initially meant to update the topic representation only and not the corresponding frequencies. Doing so might make it less transparent as to what the model initially produced as output.

Hi MaartenGr,

Thank you for the great work, I currently using BERTopic for one of my problem. I am facing issues while updating the topics with the above code. My main aim is to reduce outliers but with quality of topics. I am also curious to know, could you please suggest me to calculate the percentage of outlier BERTopic produces (any method, standard or any suggestions from you).

I am getting the following error: image

PS: Thank you for the time saving library!

Best Regards,