MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.2k stars 767 forks source link

reduce_outliers and update_topics remove stop_words and ngram_range effects #2114

Open shj37 opened 3 months ago

shj37 commented 3 months ago

Have you searched existing issues? 🔎

Desribe the bug

After running reduce_outliers and update_topics, the effects of all specifications used in vectorizer_model (stop words, ngram) are gone. The results' representation words only show single words. Thanks. 9b69b7b4e874cb2dfe351b87318e3e2d

vectorizer_model = CountVectorizer(stop_words=stop_words, ngram_range=(1, 4), min_df=5)
representation_model = MaximalMarginalRelevance(diversity=0.5)

topic_model_outlier_reduction = BERTopic(
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    top_n_words=15,
    min_topic_size=15,
    calculate_probabilities=True
)
topics_outlier_reduction, probs_outlier_reduction = topic_model_outlier_reduction.fit_transform(docs, embeddings)

new_topics = topic_model_outlier_reduction.reduce_outliers(docs, 
                                                           topics_outlier_reduction, 
                                                           threshold=0.2, strategy="distributions") # probabilities=probs_outlier_reduction,

topic_model_outlier_reduction.update_topics(docs, topics=new_topics)

BERTopic Version

0.16.0

MaartenGr commented 3 months ago

That's expected behavior since the .update_topics update the topic representations if you do not set them. So instead of this:

topic_model_outlier_reduction.update_topics(docs, topics=new_topics)

you should do this:

topic_model_outlier_reduction.update_topics(
    docs,
    topics=new_topics,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model 
)