Closed piotrcelinski closed 1 year ago
It might be that the initial number of topics that were created was already small and that there is something going on with the cluster model.
Could you provide your full code? It is difficult to see what you are exactly passing to the model. Also, which version of BERTopic are you using?
Hello, The sample contained 509 texts. BERTopic detected only the topic -1, the number of texts was 35 (looks strange for me). Bertopic version: 0.15.0. Parameters as below:
self.topic_model = BERTopic(
embedding_model=SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
umap_model= UMAP(
n_neighbors=15,
n_components=5,
min_dist=0,0,
metric='cosine'
random_state=42
)
hdbscan_model= HDBSCAN(
min_cluster_size=14,
metric='euclidean',
cluster_selection_method='eom',
prediction_data=True
)
vectorizer_model=CountVectorizer(stop_words=[***LIST OF STOPWORDS HERE***]),
ctfidf_model=ClassTfidfTransformer(reduce_frequent_words=True),
representation_model=BaseRepresentation,
language='multilingual',
nr_topics='auto',
top_n_words=20,
n_gram_range=(1, 3),
verbose=True
)
I do not send the full code, as there is a large codebase and might be very time-consuming to analyze. Piotr
Then the issue that you are getting is because no actual topics were created. HDBSCAN typically does not work that well with small datasets, so setting min_cluster_size
to a value like 3
would likely be necessary. Instead, you can use k-Means
or another algorithm where you can specify k
to perform the clustering instead. You can find more about that here.
Thank you very much!
Hi, I set the nr_topics to 'auto' in:
and got
IndexError: list index out of range
. The traceback is below:What am I doing wrong? Piotr