Open Kuniko925 opened 1 year ago
If you start from a base BERTopic model and you do not change environments, then only setting random_state
in UMAP should be sufficient to make it fully reproducible. That means that whatever you add on top of that, whether it be a cluster model, tokenizer, or something else also needs to be reproducible. As a result, it might be worthwhile to run the entire pipeline as follows to see if it is indeed reproducible:
model = BERTopic(embedding_model=sentence_model, umap_model=umap_model)
If it is indeed reproducible, then you will know that one of the other parameters might be relevant here. Then, you can test out which parameters might result in the stochastic behavior that you noticed.
@MaartenGr Thank you for your reply. I will try.
Hey @MaartenGr !
I am facing the same problem as discussed above. I have different results with each run of BERTopic even though I set the random_state=42 for UMAP and I do not use any HDBSCAN (it is by default set to None). I really want to find a solution for that problem. I will appreciate your help on that. Thanks!
main_representation = KeyBERTInspired() aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)] aspect_model3 = PartOfSpeech()
representation_model = { "Main": main_representation, "Representation2": aspect_model2, "Representation3": aspect_model3 }
topic_model = BERTopic(verbose=True, umap_model=UMAP(n_neighbors=15,n_components=5, min_dist=0.0, metric='cosine', random_state=42), seed_topic_list=seed_topic_list, vectorizer_model=CountVectorizer(stop_words=stop_words, max_df=1.0, ngram_range=(1,2)), representation_model=representation_model ).fit(df.preprocessed_text, y=df.ssot_topic_id)
@alicjamalota I am not exactly sure what the reason for this is here but it might be worthwhile to slowly remove parameters (and keep the random_state in UMAP) until it shows stable behavior. That way, you can experiment yourself and see where the issue lies. It might be as a result of seed_topic_list
or even the y
parameter but I cannot be sure without some experimentation on your side.
@MaartenGr
I encountered the symptom, which I inquired about via a ticket below again. I realised the ticket was already closed, so opened a new ticket here.
https://github.com/MaartenGr/BERTopic/issues/275#issuecomment-1505170008