MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.86k stars 728 forks source link

Probability Distribution #1779

Open sucduit opened 6 months ago

sucduit commented 6 months ago

Hi, I used the code to get the document information. For each document, I got a value of probability. From my understanding this value is the probability of the document belong to a particular topic and the topic has the highest probability value among the others. There are 21 topics from my BERTopic results. For example, document one belong to topic -1 with reported probability 0.441466441. Then I run the following code to get the probability distribution:

df=tm.approximate_distribution(doc) df_prob= pd.DataFrame(df[0])

The first document results are as following:

0.030 | 0.069 | 0.084 | 0.052 | 0.052 | 0.047 | 0.028 | 0.019 | 0.084 | 0.052 | 0.022 | 0.019 | 0.086 | 0.068 | 0.030 | 0.112 | 0.059 | 0.000 | 0.043 | 0.030 | 0.014

There are total 21 values. My question is: Are these 21 values the probabilities of document one belong to each of the 21 classified topics? If it is from tm.get_document_info(doc), I got the first document classified to topic -1 with probability around 0.44. Why the result from tm.approximate_distribution(doc), the probability is 0.030? Can you please help me understand this? I use BERTopic in my dissertation and I need to talk about probability distribution of the document. Thank you very much.

MaartenGr commented 6 months ago

I believe there are a number of issues about just this but it essentially boils down to the following. The probabilities only relate to non-outlier topics. Therefore, 0.030 belongs to topic 0 not to topic -1. The probabilities of the outliers are calculated with 1 - sum(probs).