MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.11k stars 763 forks source link

Different number of topics after training on the same dataset. #461

Closed Nadiaghobadi closed 2 years ago

Nadiaghobadi commented 2 years ago

Hi,

I noticed that if I train bertopic on a same dataset multiple times, the number of topics might range between 2-3 to 10-11. Sometimes we have a noise cluster and sometimes we don't. Do you recommend any approach to handle the randomness?

Thank you!

MaartenGr commented 2 years ago

Due to the stochastic nature of UMAP, results may differ between runs. You can set a random_state in UMAP but it might affect performance. You can find a bit more about that in the FAQ here.

Nadiaghobadi commented 2 years ago

Thank you for the quick response @MaartenGr !

ViktoriaSpaiser commented 2 years ago

Hi Maarten, sorry, but on the same issue. We just noticed that despite setting random_state to achieve reproducibility, we get different outcomes for the same data and same parameters set. Do any specifications in other model components affect the reproducibility? Here is our implementation:

# preprocess and clean data
my_stopwords = frozenset(list(["rt","RT", "&", "amp", "&amp", "http","https", "http://", "https://", "fav", "FAV"]))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words = my_stopwords, min_df=20)

# do the BERT topic modelling
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

min_clusters = round(len(alltweets) * 0.0017)
hdbscan_model = HDBSCAN(min_cluster_size= min_clusters, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)

sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(alltweets)

#run the model
topic_model = BERTopic(nr_topics = 'auto', umap_model=umap_model, hdbscan_model=hdbscan_model, embedding_model = sentence_model,vectorizer_model=vectorizer_model, low_memory=True, calculate_probabilities=True)

topics, probs = topic_model.fit_transform(alltweets, embeddings)

Run 1: 45 topics Run 2: 19 topics Run 3: 66 topics

Many thanks!

MaartenGr commented 2 years ago

@ViktoriaSpaiser I just tried your code out using the 20NewsGroup dataset and I cannot reproduce the issue you are having:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

alltweets = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

# preprocess and clean data
my_stopwords = frozenset(list(["rt","RT", "&", "amp", "&amp", "http","https", "http://", "https://", "fav", "FAV"]))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words = my_stopwords, min_df=20)

# do the BERT topic modelling
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

min_clusters = round(len(alltweets) * 0.0017)
hdbscan_model = HDBSCAN(min_cluster_size= min_clusters, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)

sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(alltweets)

#run the model
topic_model = BERTopic(nr_topics = 'auto', umap_model=umap_model, hdbscan_model=hdbscan_model, embedding_model = sentence_model,vectorizer_model=vectorizer_model, low_memory=True, calculate_probabilities=True)

topics, probs = topic_model.fit_transform(alltweets, embeddings)

When I run the code above several times, I consistently get the exact same output. Do you perform any preprocessing steps in which there might be some randomness?

ViktoriaSpaiser commented 2 years ago

Many thanks for your response Maarten, this is really strange, I wonder whether it's to do with the setup of my Python environment. And no, we don't do any preprocessing except for what is shown in "preprocess and clean data" step. We'll try to understand what is going on and report back for everyone.

ViktoriaSpaiser commented 2 years ago

Hello, just a quick update. We are still not sure what happened. It seems we got different outcomes when doing it on different machines and we are still not sure why this happened. But, we are now able to reproduce the most recent result consistently. Many thanks again for your help Maarten.

MaartenGr commented 2 years ago

Thanks for the update! Good to hear that the results can now be reproduced consistently. If you ever run into any other issues, please let me know and I'll be glad to help out.

dimitry12 commented 2 years ago

@ViktoriaSpaiser I am experimenting with the variability of the results and am curious to learn if you found the root cause of variability which you observed.

ViktoriaSpaiser commented 2 years ago

Hi @dimitry12 No, unfortunately we never found the root cause of variability. Given this occurred only once when we run the the analysis on two different machines, it may have something to do with slightly different configurations of the environment. We did not encounter this problem again. Do you have an idea what the root cause is, did you encounter the problem?

SaraAmd commented 1 year ago

can you tell me how you came up with the formula for min_clusters ?

drob-xx commented 1 year ago

@SaraAmd If you are trying to tune min_clusters you might try my TopicTuner. In my experience there is no calculation you can do to get the "right" number of min_clusters. As you probably know BERTopic takes min_clusters and uses that value for HDBSCAN min_cluster_size. In turn HDBSCAN, if not passed a different value for min_samples, will set min_samples to the same value. The relationship between a given reduction, min_sample_size and min_clusters is not linear so it can be difficult or impossible to predict the precise relationship between the two without running many tests. Testing multiple parameters through BERTopic involves quite a bit of overhead and in many cases will make a thorough parameter search impractical. TopicTuner is meant to make the process of determining the best min_sample_size and min_cluster_size as efficient and painless as possible. More on HDBSCAN tuning (which is what min_clusters is doing) here.

ashutoshraj commented 1 year ago

I tried running it on Mac (M1 chip), but it gave a different result(topics) while running it multiple times. After transferring the same code to Windows (Intel chip), I get consistent results.

KethavathSaiYashwanth commented 1 year ago

Hi Maarten, sorry, but on the same issue. We just noticed that despite setting random_state to achieve reproducibility, we get different outcomes for the same data and same parameters set. Do any specifications in other model components affect the reproducibility? Here is our implementation:

# preprocess and clean data
my_stopwords = frozenset(list(["rt","RT", "&", "amp", "&amp", "http","https", "http://", "https://", "fav", "FAV"]))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words = my_stopwords, min_df=20)

# do the BERT topic modelling
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

min_clusters = round(len(alltweets) * 0.0017)
hdbscan_model = HDBSCAN(min_cluster_size= min_clusters, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)

sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(alltweets)

#run the model
topic_model = BERTopic(nr_topics = 'auto', umap_model=umap_model, hdbscan_model=hdbscan_model, embedding_model = sentence_model,vectorizer_model=vectorizer_model, low_memory=True, calculate_probabilities=True)

topics, probs = topic_model.fit_transform(alltweets, embeddings)

Run 1: 45 topics Run 2: 19 topics Run 3: 66 topics

Many thanks!

hey, I am working on the BERTopic model but am unsure how to initialize each of the parameters of the BERTopic. How do I decide the values of parameters of umap_model, vectorizer, hdbscan, and embedding_model I am working on a dataset of user reviews of size around 18k.

MaartenGr commented 1 year ago

@KethavathSaiYashwanth I would advise going through some of the tutorials in the documentation or the ones you will find at the top of the README for more information.

KethavathSaiYashwanth commented 1 year ago

@KethavathSaiYashwanth I would advise going through some of the tutorials in the documentation or the ones you will find at the top of the README for more information.

Thanks for the reply. I will go through them once. But if I am not required to do any fine-tuning parameters and just using the default BERTopic model is fine, how do I reproduce my results? I tried the following code random_seed_1 = 42 umap_model = umap.UMAP(random_state=random_seed_1) model = BERTopic(umap_model=umap_model) But still, I was getting different results every time I run the model. I do pre-processing but that gives the same data every time before fitting into the model.

MaartenGr commented 1 year ago

@KethavathSaiYashwanth Could you share your full code? It is difficult to say what is happening without seeing everything. Also, it might be worthwhile to use the parameters as listed here for better performance.

KethavathSaiYashwanth commented 1 year ago

@KethavathSaiYashwanth Could you share your full code? It is difficult to say what is happening without seeing everything. Also, it might be worthwhile to use the parameters as listed here for better performance.

Sorry but I cannot share the whole code, I will share the part where the problem lies

from bertopic import BERTopic import umap random_seed_1 = 42 random_seed_2 = 100 umap_model = umap.UMAP(random_state=random_seed_1) model = BERTopic(umap_model=umap_model) docs_i_train = negative_rows_i_train['Installer Verbatim Lemmatized'].tolist() topics, probs = model.fit_transform(docs_i_train) negative_rows_i_train['topic_Installer'] = topics negative_rows_i_train['probability_Installer'] = probs negative_rows_i_train model.get_topic_freq() model.get_topic_info()

The negative_rows_i_train is a pandas data frame with 14993 rows consisting of user reviews. through the model.get_topic_info(), it generated topics from -1 to 199. Next time I run it, the negative_rows_i_train was the same but the model gave a different number of topics and different names of topics.

MaartenGr commented 1 year ago

@KethavathSaiYashwanth Based on what you shared, this should be reproducible assuming you run it in the exact same environment with the same BERTopic version. Which version of BERTopic are you using? Also, could you try the following and see whether the results stay the same:

from bertopic import BERTopic
from umap import UMAP
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model)
topics, probs = model.fit_transform(docs)
KethavathSaiYashwanth commented 1 year ago

@KethavathSaiYashwanth Based on what you shared, this should be reproducible assuming you run it in the exact same environment with the same BERTopic version. Which version of BERTopic are you using? Also, could you try the following and see whether the results stay the same:

from bertopic import BERTopic
from umap import UMAP
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model)
topics, probs = model.fit_transform(docs)

Thank you. My bertopic model is reproducing results now and the version I have is 0.15.0. I will explore more about the parameters and try to improve my model.

KethavathSaiYashwanth commented 1 year ago

@MaartenGr Can you help me with another query? It is regarding guided topic modeling. I have a set of around 15 labels and I want to assign each text of my data(10k+ texts) to one of those labels. How should I use the BERTopic model?

MaartenGr commented 1 year ago

@KethavathSaiYashwanth Generally, you can use guided, manual, or semi-supervised BERTopic for that. It depends on whether you have labeled data or only the labels themselves. If you want to assign each text one of those specific labels, then using cosine similarity between the embeddings of the labels and texts to assign each label to a text is worth trying out. The resulting labels can then be used in manual BERTopic.

KethavathSaiYashwanth commented 1 year ago

@KethavathSaiYashwanth Generally, you can use guided, manual, or semi-supervised BERTopic for that. It depends on whether you have labeled data or only the labels themselves. If you want to assign each text one of those specific labels, then using cosine similarity between the embeddings of the labels and texts to assign each label to a text is worth trying out. The resulting labels can then be used in manual BERTopic.

I tried guided and it is providing good results. I have a dataset of large size (around 17k) and therefore I am getting a lot of topics generated. I need to fine-tune min_cluster_size mainly and decide what's the optimal value. Also, I need to fine-tune some other parameters of umap and hdbscan and find their optimal values for my dataset. What metric should I use?

I am trying with coherence score but there seems to be a problem with gensim.models import (says TypeError: <lambda>() got an unexpected keyword argument 'do_setlocale'). So, can you recommend any other way to find coherent score?

Also, while visualizing documents, the clusters seem to be very close in the plot and I think it is because my dataset is very large and many texts have similar words. But, if I need to make the clusters distant from each other, what parameter should I change and play with?

MaartenGr commented 1 year ago

@KethavathSaiYashwanth

Also, I need to fine-tune some other parameters of umap and hdbscan and find their optimal values for my dataset. What metric should I use?

That highly depends on your use case and what you want it to optimize for. There is no fixed method that works best.

So, can you recommend any other way to find coherent score?

It might be worthwhile to post that issue on their repository to see if you can find some help to calculate the score. Other than that, OCTIS is a nice package for evaluation.

Also, while visualizing documents, the clusters seem to be very close in the plot and I think it is because my dataset is very large and many texts have similar words. But, if I need to make the clusters distant from each other, what parameter should I change and play with?

t-SNE tends to make the cluster separation bigger, so using that for visualization might be worthwhile.

KethavathSaiYashwanth commented 1 year ago

@KethavathSaiYashwanth

Also, I need to fine-tune some other parameters of umap and hdbscan and find their optimal values for my dataset. What metric should I use?

That highly depends on your use case and what you want it to optimize for. There is no fixed method that works best.

So, can you recommend any other way to find coherent score?

It might be worthwhile to post that issue on their repository to see if you can find some help to calculate the score. Other than that, OCTIS is a nice package for evaluation.

Also, while visualizing documents, the clusters seem to be very close in the plot and I think it is because my dataset is very large and many texts have similar words. But, if I need to make the clusters distant from each other, what parameter should I change and play with?

t-SNE tends to make the cluster separation bigger, so using that for visualization might be worthwhile.

I resolved the issue with gensim and coherent score calculation is working. But the coherencemodel from gensim needs to generate topic_tokens every time a parameter is changed. Therefore, the model has to fit and transform every time a parameter is changed, and it takes up a lot of time. Is there any better way of fine-tuning the bertopic model?

MaartenGr commented 1 year ago

@KethavathSaiYashwanth You could pre-calculate the embeddings beforehand and pass those. That should speed up fitting quite a bit. You could also do the same with the dimensionality reduced embeddings if you will keep the parameters the same.

faiztest commented 1 year ago

I have the same problem. When the number of topics is 5 or 6 I can easily reproduce the same result, but when I put 7 or 8, the result will be different after re-run. Do you have any solution?

p.s. I run it on Streamlit, but when I run it on Google Colab, it works fine!

MaartenGr commented 1 year ago

@faiztest Have you read the FAQ on this issue? Note that it is about re-running on the exact same environment. If you switch environments, for example with a different OS, then results might differ. Other than that it is difficult to say without seeing the actual code.

henrique-back commented 11 months ago

@faiztest Have you read the FAQ on this issue? Note that it is about re-running on the exact same environment. If you switch environments, for example with a different OS, then results might differ. Other than that it is difficult to say without seeing the actual code.

@MaartenGr , I cannot reproduce my results on a different environment, even though the same version of BERTopic is installed. Do you have any ideas on why different environments produce different results and how this can be avoided?

MaartenGr commented 11 months ago

@henrique-back If the only thing you freeze is the BERTopic version, then the environments might differ in sub-dependencies. You would need to do a full version control of the environment to make sure that everything stays the same. If you use a different OS, then that might be the culprit as that is a known issue with UMAP I believe. If you search the issues, I think there are a couple of issues on the subject.

abdullahfurquan commented 8 months ago

Hi,

I am also facing similar issue. If I train bertopic on a same dataset multiple times, I am getting different number of topics .

As per above discussion in this thread. I have tried below two ways . But neither resolved the issue. In both ways running code multiple times is giving different number of topics:-

I am running programme on amazon sagemaker notebook instance.

docs : this my list of documents used in training .

(1) from sklearn.feature_extraction.text import CountVectorizer from umap import UMAP from bertopic import BERTopic

vectorizer_model = CountVectorizer(ngram_range=(2, 3), stop_words='english') umap_model = UMAP(random_state=42) topic_model = BERTopic(vectorizer_model=vectorizer_model , umap_model=umap_model) topics, probabilities = topic_model.fit_transform(docs)

(2) from sklearn.feature_extraction.text import CountVectorizer from umap import UMAP from bertopic import BERTopic

vectorizer_model = CountVectorizer(ngram_range=(2, 3), stop_words='english') umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) topic_model = BERTopic(vectorizer_model=vectorizer_model , umap_model=umap_model) topics, probabilities = topic_model.fit_transform(docs)

Thank you!

DanielGrothKU commented 8 months ago

I have the exact same code as you @abdullahfurquan and I have encountered the same issue. I am using Google Colab and their GPU's but in the past when I used the code it was actually reproducible. Hope this is figured out.

BR,

Daniel

MaartenGr commented 8 months ago

@DanielGrothKU It might be worthwhile to try installing UMAP through a specific commit as mentioned in the referenced issue here. Let me know how it works out!