MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6k stars 752 forks source link

Cannot reproduce the same result #1264

Open Kuniko925 opened 1 year ago

Kuniko925 commented 1 year ago

@MaartenGr

I encountered the symptom, which I inquired about via a ticket below again. I realised the ticket was already closed, so opened a new ticket here.

https://github.com/MaartenGr/BERTopic/issues/275#issuecomment-1505170008

I got different results whenever it was executed. Once it looked resolved by the code below. - Delete nr_topics - Delete n_components However, I executed it again today, and then my model could not reproduce the same results. So, I would like to know how to configure it to get the same result always. ``` from bertopic import BERTopic from sentence_transformers import SentenceTransformer, util from umap import UMAP from hdbscan import HDBSCAN from sklearn.feature_extraction.text import CountVectorizer from bertopic.vectorizers import ClassTfidfTransformer from sklearn.cluster import KMeans import nltk from nltk.corpus import stopwords nltk.download("punkt") nltk.download("wordnet") nltk.download("omw-1.4") nltk.download("english") nltk.download("stopwords") """ Reference URL: https://github.com/MaartenGr/BERTopic/issues/286 """ from nltk import word_tokenize from nltk.stem import WordNetLemmatizer class LemmaTokenizer: def __init__(self): self.wnl = WordNetLemmatizer() def __call__(self, doc): return [self.wnl.lemmatize(t) for t in word_tokenize(doc)] n_neighbors = 30 min_cluster_size = 25 top_n_words = 100 min_samples = 1 ngram_range = (1, 3) sentence_model = SentenceTransformer("all-MiniLM-L12-v2") umap_model = UMAP(n_neighbors=n_neighbors, min_dist=0.0, metric="cosine", random_state=42) hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, metric="euclidean", cluster_selection_method="eom", prediction_data=True, min_samples=min_samples) ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True) vectorizer_model = CountVectorizer(ngram_range=ngram_range, max_df=0.70, tokenizer=LemmaTokenizer(), stop_words=stopwords.words("english")) model = BERTopic(language="english", top_n_words=top_n_words, embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model, calculate_probabilities=True) topics, probs = model.fit_transform(abstract) ``` Thank you very much. Kuniko
MaartenGr commented 1 year ago

If you start from a base BERTopic model and you do not change environments, then only setting random_state in UMAP should be sufficient to make it fully reproducible. That means that whatever you add on top of that, whether it be a cluster model, tokenizer, or something else also needs to be reproducible. As a result, it might be worthwhile to run the entire pipeline as follows to see if it is indeed reproducible:

model = BERTopic(embedding_model=sentence_model, umap_model=umap_model)

If it is indeed reproducible, then you will know that one of the other parameters might be relevant here. Then, you can test out which parameters might result in the stochastic behavior that you noticed.

Kuniko925 commented 1 year ago

@MaartenGr Thank you for your reply. I will try.

alicjamalota commented 1 year ago

Hey @MaartenGr !

I am facing the same problem as discussed above. I have different results with each run of BERTopic even though I set the random_state=42 for UMAP and I do not use any HDBSCAN (it is by default set to None). I really want to find a solution for that problem. I will appreciate your help on that. Thanks!

main_representation = KeyBERTInspired() aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)] aspect_model3 = PartOfSpeech()

representation_model = { "Main": main_representation, "Representation2": aspect_model2, "Representation3": aspect_model3 }

topic_model = BERTopic(verbose=True, umap_model=UMAP(n_neighbors=15,n_components=5, min_dist=0.0, metric='cosine', random_state=42), seed_topic_list=seed_topic_list, vectorizer_model=CountVectorizer(stop_words=stop_words, max_df=1.0, ngram_range=(1,2)), representation_model=representation_model ).fit(df.preprocessed_text, y=df.ssot_topic_id)

MaartenGr commented 1 year ago

@alicjamalota I am not exactly sure what the reason for this is here but it might be worthwhile to slowly remove parameters (and keep the random_state in UMAP) until it shows stable behavior. That way, you can experiment yourself and see where the issue lies. It might be as a result of seed_topic_list or even the y parameter but I cannot be sure without some experimentation on your side.