BERTopic - Topic reduction produced 18 topics when nr_topics=96

Have you searched existing issues? 🔎

[X] I have searched and found no existing issues

Desribe the bug

I am stuck with the bug: set nr_topics=96 (min_cluster_size=60, min_samples=30) then have "BERTopic - Topic reduction - Reduced number of topics from 18 to 18". Should it be like "BERTopic - Topic reduction - Reduced number of topics from 96 to 18" or something? Do you have any ideas on how to get rid of this "Topic reduction" step?

Reproduction

Train model

representation_model = {
    "Main": KeyBERTInspired(),
    "POS": [
        PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns),
        MaximalMarginalRelevance(diversity=.4)
    ],
}

# Vectorizer runs after embeddings are generated, it's only affect topic's words representation, therefore we use stop_words and lemmatizer here
vectorizer_model = CountVectorizer(
    min_df=MIN_DF,
    max_df=MAX_DF,
    ngram_range=NGRAM_RANGE,
    stop_words='english',
    tokenizer=LemmaTokenizer(),
)

umap_model = UMAP(
    n_neighbors=15,
    n_components=50,
    min_dist=0.0,
    metric='cosine',
    random_state=42  # UMAP is stochastic algorithm, so don't forget to set seed to make results reproduciable
)

hdbscan_model = HDBSCAN(
    min_cluster_size=60, min_samples=30, # Use values from TMT
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True
)

topic_model = BERTopic(
    nr_topics=96, # Use value from TMT
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
)

topics, ini_probs = topic_model.fit_transform(splited_data)

2024-07-26 17:20:18,830 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%
 1563/1563 [00:13<00:00, 141.83it/s]
2024-07-26 17:20:33,447 - BERTopic - Embedding - Completed ✓
2024-07-26 17:20:33,448 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-07-26 17:20:35,428 - BERTopic - Dimensionality - Completed ✓
2024-07-26 17:20:35,441 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-07-26 17:20:37,022 - BERTopic - Cluster - Completed ✓
2024-07-26 17:20:37,023 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-07-26 17:21:25,257 - BERTopic - Representation - Completed ✓
2024-07-26 17:21:25,258 - BERTopic - Topic reduction - Reducing number of topics
2024-07-26 17:21:25,261 - BERTopic - Topic reduction - Reduced number of topics from 18 to 18

BERTopic Version

0.16.3

MaartenGr / BERTopic

BERTopic - Topic reduction produced 18 topics when nr_topics=96 #2099

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

Train model

BERTopic Version