MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Two curious questions #275

Closed fgergvdsvgsdh closed 2 years ago

fgergvdsvgsdh commented 3 years ago
  1. I want to know why when I run the BerTopic different times I get different results (topics etc..). I am also interested on the theoretical point of view I guess it has something to do with random processes but i don't have a clear view.
  2. Once I save the topic model what can I extract from the saved model? Can I extract everything and continue from the loading point like I previously ran the model?
MaartenGr commented 3 years ago
  1. UMAP is stochastic by nature which means that every time you run it, you get different results. You can find a bit more about that here.
  2. You can continue from the loading point like you previously ran the model. You cannot continue training it after you load it, that is a feature not implemented in BERTopic.
cb-pratibhasaha commented 2 years ago

I was curious to know if there a way to make sure that everytime I run Bertopic, I get the same topics/results? Also, if it produces different results each time it is run,with the count of different topics differing each time, how do we know it is giving accurate results?

Thanks in advance.

MaartenGr commented 2 years ago

@cb-pratibhasaha

I was curious to know if there a way to make sure that everytime I run Bertopic, I get the same topics/results?

The underlying dimensionality reduction algorithm, UMAP, is stochastic which results in different results each run. You can find more about that here and how to get the same results each time.

Also, if it produces different results each time it is run,with the count of different topics differing each time, how do we know it is giving accurate results?

That depends on your definition of "accurate". Seeing as topic modeling can be quite subjective, it really depends on your use case, your evaluation metrics, the stakeholders involved etc. From an algorithmic perspective, you can still set a random_state in UMAP which is similar to what you can do in many other algorithms.

cb-pratibhasaha commented 2 years ago

Thank you for the response @MaartenGr. I used the following code and set the random state to specific number yet it gives a different result every time I run :

from bertopic import BERTopic from sklearn.cluster import KMeans from umap import UMAP from hdbscan import HDBSCAN cluster_model = KMeans(n_clusters=150) umap_model = UMAP(random_state=42) topic_model = BERTopic(hdbscan_model=cluster_model, vectorizer_model = vectorizer_model_cb, umap_model = umapmodel) topics , = topic_model.fit_transform(text)

Do let me know your thoughts. Again, thank you for the response.

cb-pratibhasaha commented 2 years ago

As an extension, for choosing the clustering algorithm, HBDSCAN gives too many outliers while K-Means pushes outliers into categories they do not belong to. I wanted to ask if there is a clustering algorithm that is sort of a middle ground to both these algorithms such that the proportion of outliers is reduced without compromising accuracy?

MaartenGr commented 2 years ago

Thank you for the response @MaartenGr. I used the following code and set the random state to specific number yet it gives a different result every time I run :

This is because KMeans also has a random_state parameter that you should use that make sure every run is the same.

As an extension, for choosing the clustering algorithm, HBDSCAN gives too many outliers while K-Means pushes outliers into categories they do not belong to. I wanted to ask if there is a clustering algorithm that is sort of a middle ground to both these algorithms such that the proportion of outliers is reduced without compromising accuracy?

I would advise playing around with some of the HDBSCAN parameters. Especially lowering min_samples and min_cluster_size can result in fewer outliers. You can find more about that here. That way, you have some control over which portion of outliers suits your use case.

Kuniko925 commented 1 year ago

The underlying dimensionality reduction algorithm, UMAP, is stochastic which results in different results each run. You can > >find more about that here >and how to get the same results each time.

I am using the random state for my model by following the information above, but I got different results whenever it was executed. So far, I have faced two different results during running more than ten times. Are there any ways to get the same results every time?

The structure is as follows.

Thank you very much in advance.

MaartenGr commented 1 year ago

@Kuniko925 Could you share your entire code or create a reproducible example? Without it, it is quite difficult to understand what is happening in your specific environment. Also, could you share the version of BERTopic and its dependencies?

Kuniko925 commented 1 year ago

@MaartenGr Thank you very much for your reply. I am sorry that I did not share my code. Version of bertopic is 0.14.1.

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from sklearn.cluster import KMeans

import nltk
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("english")
nltk.download("stopwords")

"""
Reference URL: https://github.com/MaartenGr/BERTopic/issues/286
"""
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer:
  def __init__(self):
    self.wnl = WordNetLemmatizer()
  def __call__(self, doc):
    return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

def modeling_bertopic():

  n_neighbors = 30
  n_components = 100
  min_cluster_size = 25
  top_n_words = 30
  min_samples = 1
  ngram_range = (1, 3)

  sentence_model = SentenceTransformer("all-MiniLM-L12-v2")
  umap_model = UMAP(n_neighbors=n_neighbors, n_components=n_components, min_dist=0.0, metric="cosine", random_state=42)
  hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, metric="euclidean", cluster_selection_method="eom", prediction_data=True, min_samples=min_samples)
  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
  vectorizer_model = CountVectorizer(ngram_range=ngram_range, max_df=0.70, tokenizer=LemmaTokenizer(), stop_words=stopwords.words("english"))

  model = BERTopic(nr_topics = "auto", language="english", top_n_words=top_n_words,
                       embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=hdbscan_model, 
                       ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model,
                      calculate_probabilities=True)

  return model

topic_model = modeling_bertopic()
topics, probs = topic_model.fit_transform(abstract)

Different results were generated when changing the sessions. I ran the code and ran again with the same session, then my model provided the same topics. I ran the code, disconnected the session, and ran again, then my model provided the different topics from before the disconnection.

Thank you very much again.

MaartenGr commented 1 year ago

@Kuniko925 Could you also try it with k-Means instead of HDBSCAN? It might be that there is something going on there. Also, perhaps not setting nr_topics at all might be worthwhile to try out. Other than that, I believe it should create reproducible results. Iteratively removing a parameter that each runs to see where the main issue lies might help.

NOTE: I edited your message to make sure the code is displayed in a readable format.

Kuniko925 commented 1 year ago

@MaartenGr Thank you for getting back to me and updating the message. I tried running with k-Means instead of HDBSCAN. However, the results of the topics were changed. I removed nr_topics as well, but the topics were changed...

Kuniko925 commented 1 year ago

@MaartenGr Sorry to bother you repeatedly. I tried removing a parameter of UMAP, n_components. Then so far, the same topics have been reproduced. Thank you very much for your support.

MaartenGr commented 1 year ago

@Kuniko925 No problem at all! Glad to hear that you fixed the issue.

alicjamalota commented 1 year ago

@MaartenGr Sorry to bother you repeatedly. I tried removing a parameter of UMAP, n_components. Then so far, the same topics have been reproduced. Thank you very much for your support.

Hey @Kuniko925 ! What did you mean by 'removing' the n_components parameter? Thanks!