MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.2k stars 767 forks source link

BERTopic - Topic reduction produced 18 topics when nr_topics=96 #2099

Open kzaho-m opened 4 months ago

kzaho-m commented 4 months ago

Have you searched existing issues? 🔎

Desribe the bug

I am stuck with the bug: set nr_topics=96 (min_cluster_size=60, min_samples=30) then have "BERTopic - Topic reduction - Reduced number of topics from 18 to 18". Should it be like "BERTopic - Topic reduction - Reduced number of topics from 96 to 18" or something? Do you have any ideas on how to get rid of this "Topic reduction" step?

Reproduction

Train model

representation_model = {
    "Main": KeyBERTInspired(),
    "POS": [
        PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns),
        MaximalMarginalRelevance(diversity=.4)
    ],
}

# Vectorizer runs after embeddings are generated, it's only affect topic's words representation, therefore we use stop_words and lemmatizer here
vectorizer_model = CountVectorizer(
    min_df=MIN_DF,
    max_df=MAX_DF,
    ngram_range=NGRAM_RANGE,
    stop_words='english',
    tokenizer=LemmaTokenizer(),
)

umap_model = UMAP(
    n_neighbors=15,
    n_components=50,
    min_dist=0.0,
    metric='cosine',
    random_state=42  # UMAP is stochastic algorithm, so don't forget to set seed to make results reproduciable
)

hdbscan_model = HDBSCAN(
    min_cluster_size=60, min_samples=30, # Use values from TMT
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True
)

topic_model = BERTopic(
    nr_topics=96, # Use value from TMT
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
)

topics, ini_probs = topic_model.fit_transform(splited_data)
2024-07-26 17:20:18,830 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%
 1563/1563 [00:13<00:00, 141.83it/s]
2024-07-26 17:20:33,447 - BERTopic - Embedding - Completed ✓
2024-07-26 17:20:33,448 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-07-26 17:20:35,428 - BERTopic - Dimensionality - Completed ✓
2024-07-26 17:20:35,441 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-07-26 17:20:37,022 - BERTopic - Cluster - Completed ✓
2024-07-26 17:20:37,023 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-07-26 17:21:25,257 - BERTopic - Representation - Completed ✓
2024-07-26 17:21:25,258 - BERTopic - Topic reduction - Reducing number of topics
2024-07-26 17:21:25,261 - BERTopic - Topic reduction - Reduced number of topics from 18 to 18

BERTopic Version

0.16.3

MaartenGr commented 4 months ago

What happens here is that the cluster model actually only found 18 clusters. Then, you ask it to reduce it to 96 which is not possible since you only found 18 clusters. You cannot reduce 18 clusters into 96. You can only do that if the initial number of clusters is higher than the value you set for nr_topics.