Closed yanfan0531 closed 2 years ago
Hi @yanfan0531, apologies for the late reply!
The height of the probabilities depends on the clustering algorithm used, namely HDBSCAN. Some documents might appear in very dense structures and are therefore quite certain to have a high probability. Also, note that HDBSCAN does not make its selection purely on the probabilities, those are calculated after the clustering. It might also be worthwhile to read through HDBSCAN's documentation here.
Thanks Maarten!
hi Maarten,
I'm trying to train BERTopic with docs and extract top 30 documents with highest scores (descending order of doc_probs) for each topic as follows:
doc_topics, doc_probs = topic_model.fit_transform(docs)
However, I found that for each topic, the probabilities of top 30 documents are mostly 1.0. I was expecting probabilities like 0.98 or something less than 1.0. Is it normal?
The document size is 300,000. I tried out following params: min_df = 50 ngram_range = (1,1) min_topic_size = 200 n_neighbors = 100 and finally it automatically generated 128 topics.