MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Doubt about the threshold to assing the documents to a specific topic #686

Closed felipelopezp726 closed 2 years ago

felipelopezp726 commented 2 years ago

Hi Maarten, more than a problem it is a doubt.

What is the threshold of the probabilities to assign a register (a document) to a specific topic? That is, from what value in the probability of belonging to a specific topic is a document assigned to a topic other than topic -1 (outliers)?

I ask you the above because I have made several tests and analyzing the probabilities of those records that have the prediction of belonging to the topic "-1", I realized that sometimes they have probabilities of belonging to other topics higher than 0.25 (see in the attached image the row 5 and 13) or higher than 0.1 (see rows 9, 13, 19). And on the other hand, taking row 18 as an example, as you can see the probability of belonging to topic "5" is 0.18, and in this case it did not assign this register to topic "-1", but it dit to the topic "5". Then, writing a little better the questions would be:

1) What is the threshold in the probabilities for assigning a register to a specific topic other than "-1" (because initially I thought it was 0.5)?

2) What is the criterion for assigning a record to the "-1" topic, i.e., at what thresholds or limits is the decision made to assign the records to the "-1" topic?

Duda_Probabilidades

MaartenGr commented 2 years ago

The probabilities that are generated in BERTopic are not exactly those on which the algorithm decides which topic belongs to which document. These probabilities are actually calculated after the model has created the topics for each document. In other words, the model tries to recreate these probabilities based on how the topics were created.

This means that the answer to the following question:

What is the threshold in the probabilities for assigning a register to a specific topic other than "-1" (because initially I thought it was 0.5)?

is that it does not have a threshold in the probabilities for assigning a register to a specific topic since it does not use the probabilities for that.

Similarly, the answer to:

What is the criterion for assigning a record to the "-1" topic, i.e., at what thresholds or limits is the decision made to assign the records to the "-1" topic?

is a bit more involved than the probabilities that are calculated and technically do not involve them directly.

All in all, I believe that there are two sources that you can read through that should make it a bit more clear. First, reading through the underlying algorithm of BERTopic to get an idea of how the approach works and how clustering is involved. Second, reading through the HDBSCAN documentation and more specifically, this page which explains how soft clustering in HDBSCAN works and thereby how the probabilities are calculated.

MaartenGr commented 2 years ago

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!