Closed fgergvdsvgsdh closed 2 years ago
I was curious to know if there a way to make sure that everytime I run Bertopic, I get the same topics/results? Also, if it produces different results each time it is run,with the count of different topics differing each time, how do we know it is giving accurate results?
Thanks in advance.
@cb-pratibhasaha
I was curious to know if there a way to make sure that everytime I run Bertopic, I get the same topics/results?
The underlying dimensionality reduction algorithm, UMAP, is stochastic which results in different results each run. You can find more about that here and how to get the same results each time.
Also, if it produces different results each time it is run,with the count of different topics differing each time, how do we know it is giving accurate results?
That depends on your definition of "accurate". Seeing as topic modeling can be quite subjective, it really depends on your use case, your evaluation metrics, the stakeholders involved etc. From an algorithmic perspective, you can still set a random_state
in UMAP which is similar to what you can do in many other algorithms.
Thank you for the response @MaartenGr. I used the following code and set the random state to specific number yet it gives a different result every time I run :
from bertopic import BERTopic from sklearn.cluster import KMeans from umap import UMAP from hdbscan import HDBSCAN cluster_model = KMeans(n_clusters=150) umap_model = UMAP(random_state=42) topic_model = BERTopic(hdbscan_model=cluster_model, vectorizer_model = vectorizer_model_cb, umap_model = umapmodel) topics , = topic_model.fit_transform(text)
Do let me know your thoughts. Again, thank you for the response.
As an extension, for choosing the clustering algorithm, HBDSCAN gives too many outliers while K-Means pushes outliers into categories they do not belong to. I wanted to ask if there is a clustering algorithm that is sort of a middle ground to both these algorithms such that the proportion of outliers is reduced without compromising accuracy?
Thank you for the response @MaartenGr. I used the following code and set the random state to specific number yet it gives a different result every time I run :
This is because KMeans
also has a random_state
parameter that you should use that make sure every run is the same.
As an extension, for choosing the clustering algorithm, HBDSCAN gives too many outliers while K-Means pushes outliers into categories they do not belong to. I wanted to ask if there is a clustering algorithm that is sort of a middle ground to both these algorithms such that the proportion of outliers is reduced without compromising accuracy?
I would advise playing around with some of the HDBSCAN
parameters. Especially lowering min_samples
and min_cluster_size
can result in fewer outliers. You can find more about that here. That way, you have some control over which portion of outliers suits your use case.
The underlying dimensionality reduction algorithm, UMAP, is stochastic which results in different results each run. You can > >find more about that here >and how to get the same results each time.
I am using the random state for my model by following the information above, but I got different results whenever it was executed. So far, I have faced two different results during running more than ten times. Are there any ways to get the same results every time?
The structure is as follows.
Thank you very much in advance.
@Kuniko925 Could you share your entire code or create a reproducible example? Without it, it is quite difficult to understand what is happening in your specific environment. Also, could you share the version of BERTopic and its dependencies?
@MaartenGr Thank you very much for your reply. I am sorry that I did not share my code. Version of bertopic is 0.14.1.
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("english")
nltk.download("stopwords")
"""
Reference URL: https://github.com/MaartenGr/BERTopic/issues/286
"""
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer:
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
def modeling_bertopic():
n_neighbors = 30
n_components = 100
min_cluster_size = 25
top_n_words = 30
min_samples = 1
ngram_range = (1, 3)
sentence_model = SentenceTransformer("all-MiniLM-L12-v2")
umap_model = UMAP(n_neighbors=n_neighbors, n_components=n_components, min_dist=0.0, metric="cosine", random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, metric="euclidean", cluster_selection_method="eom", prediction_data=True, min_samples=min_samples)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
vectorizer_model = CountVectorizer(ngram_range=ngram_range, max_df=0.70, tokenizer=LemmaTokenizer(), stop_words=stopwords.words("english"))
model = BERTopic(nr_topics = "auto", language="english", top_n_words=top_n_words,
embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=hdbscan_model,
ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model,
calculate_probabilities=True)
return model
topic_model = modeling_bertopic()
topics, probs = topic_model.fit_transform(abstract)
Different results were generated when changing the sessions. I ran the code and ran again with the same session, then my model provided the same topics. I ran the code, disconnected the session, and ran again, then my model provided the different topics from before the disconnection.
Thank you very much again.
@Kuniko925 Could you also try it with k-Means instead of HDBSCAN? It might be that there is something going on there. Also, perhaps not setting nr_topics
at all might be worthwhile to try out. Other than that, I believe it should create reproducible results. Iteratively removing a parameter that each runs to see where the main issue lies might help.
NOTE: I edited your message to make sure the code is displayed in a readable format.
@MaartenGr Thank you for getting back to me and updating the message. I tried running with k-Means instead of HDBSCAN. However, the results of the topics were changed. I removed nr_topics as well, but the topics were changed...
@MaartenGr Sorry to bother you repeatedly. I tried removing a parameter of UMAP, n_components. Then so far, the same topics have been reproduced. Thank you very much for your support.
@Kuniko925 No problem at all! Glad to hear that you fixed the issue.
@MaartenGr Sorry to bother you repeatedly. I tried removing a parameter of UMAP, n_components. Then so far, the same topics have been reproduced. Thank you very much for your support.
Hey @Kuniko925 ! What did you mean by 'removing' the n_components parameter? Thanks!