top documents with probability of 1.0 for each topic

yanfan0531 commented 2 years ago

hi Maarten,

I'm trying to train BERTopic with docs and extract top 30 documents with highest scores (descending order of doc_probs) for each topic as follows:

doc_topics, doc_probs = topic_model.fit_transform(docs)

However, I found that for each topic, the probabilities of top 30 documents are mostly 1.0. I was expecting probabilities like 0.98 or something less than 1.0. Is it normal?

The document size is 300,000. I tried out following params: min_df = 50 ngram_range = (1,1) min_topic_size = 200 n_neighbors = 100 and finally it automatically generated 128 topics.

MaartenGr commented 2 years ago

Hi @yanfan0531, apologies for the late reply!

The height of the probabilities depends on the clustering algorithm used, namely HDBSCAN. Some documents might appear in very dense structures and are therefore quite certain to have a high probability. Also, note that HDBSCAN does not make its selection purely on the probabilities, those are calculated after the clustering. It might also be worthwhile to read through HDBSCAN's documentation here.

yanfan0531 commented 2 years ago

Thanks Maarten!

MaartenGr / BERTopic

top documents with probability of 1.0 for each topic #700