Which hyper parameter mostly influence the number of topics for Chinese texts?

fishfree commented 4 months ago

import jieba
def tokenize_zh(text):
    words = jieba.lcut(text)
    words = list(filter(lambda x: (len(x)>1), words))
    return words

import numpy as np
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("BAAI/bge-base-zh-v1.5") # a Chinese embedding model
vectorizer_model = CountVectorizer(tokenizer=tokenize_zh, stop_words=stopwords, ngram_range=(1, 3), min_df=3) # The variiable stopwords is a list.
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(language='chinese', embedding_model=embedding_model, umap_model=umap_model, top_n_words=10, min_topic_size=10, n_gram_range=(1,3), vectorizer_model=vectorizer_model, calculate_probabilities=True, verbose=True)

topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

It has only 3 topics as below, much less than the Mallet tool.

I almost test every hyper parameters here, at last found the n_neighborsin the UMAPfunction works most apparently, however, even changing it from 15 to 50 only adds 1 new topic as below:

It seems BERTopic need more tuning parameters for Chinese or even CJK texts. Can anyone share some experience, please?

MaartenGr commented 4 months ago

To get more topics, you would need to decrease the value of min_topic_size. The higher the minimum topic size, the fewer topics it can create. I would suggest reading through the best practices for more on this or use a different clustering model like k-Means that allows you to manually select the number of topics.

rap8 commented 4 months ago

It is recommended to use K-means as the clustering algorithm. When I am doing Chinese topic modeling, the kmeas algorithm will be much better.

fishfree commented 4 months ago

@rap8 Thank you for your sharing. Would you please share your code snippet?

MaartenGr / BERTopic

Which hyper parameter mostly influence the number of topics for Chinese texts? #1998