MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 751 forks source link

Number of topics decreases significantly every run #1784

Open daianaccrisan opened 7 months ago

daianaccrisan commented 7 months ago

Hello Maarten!

These are the configs I am using for my model to run on a dataset of news articles. When running the model with the default min_cluster_size, I get 200+ topics. When I run it the second time, I get five topics (for 7,500 documents). I tried it with different numbers for the min_cluster_size and for whatever number I give ( 30, 100) I get 3 topics.

hdbscan_model = HDBSCAN(min_cluster_size=20,prediction_data=True). embedding_model = "sentence-transformers/all-MiniLM-L6-v2" vectorizer_model = CountVectorizer(stop_words="english")

`topic_model = BERTopic(

top_n_words=15,
hdbscan_model=hdbscan_model,
umap_model=UMAP(),
vectorizer_model=vectorizer_model,
language='english',
verbose=True)`

Could you please tell me if I am doing something wrong? I am running the code in Google Colab and have used BERTopic before i this env but my results have never differed so much from one run to another.

Best regards, Daiana

MaartenGr commented 7 months ago

You are not passing the UMAP model to BERTopic. Please make sure to follow the best practices guide here: https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html or the FAQ here: https://maartengr.github.io/BERTopic/faq.html#why-are-the-results-not-consistent-between-runs