MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 766 forks source link

Intertopic Distance Map keeps changing every time I rerun it #1822

Open Yanith1 opened 8 months ago

Yanith1 commented 8 months ago

Hello there,

So I am pretty new to this, but I am really interested in using this to explore my corpus. I am not sure if this is inherent in the code itself, but whenever I try to rerun it, it keeps loading a different form of intertopic distance map. So this means that I cannot replicate it which is not ideal. I have attached below the codes that I used. Thank you!

` df_clean = df.dropna(subset=['Policy_Content']) umap = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=False, random_state=123) vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

topic_model = BERTopic(umap_model=umap,vectorizer_model=vectorizer_model, verbose=True) topics, probs = topic_model.fit_transform(df_clean['Policy_Content'])

227 topics in total

topic_model.reduce_topics(df_clean['Policy_Content'], nr_topics=48)

topic_model.visualize_topics() `

Warm regards, Yanith

MaartenGr commented 8 months ago

That is correct. The visualize_topics method reduces the topic embeddings to 2-dimensional space with UMAP which has not set a random state. If you were to set a random state, then it would slow things down. You could create your own version by adopting the code here.

Yanith1 commented 8 months ago

Thanks a lot for your help Maarten. I really appreciate this!