Open Connorwz opened 2 months ago
Have you tried looking at some of the hyperparameters of approximate_distribution
? Since there are similarity metrics/values involved, it might help to look at whether you can reduce the minimum similarity necessary. You can find more about some of them here.
Thanks for your reply! However, may I ask why minimum similarity affects my problem that i have some documents have no probabilities to any one of clusters/topics created by the model?
Sure! You need a minimum similarity to decide which subset of topics is most related to your document. It allows you to filter the most related topics. By lowering the minimum similarity, you will get more topics related to the document (although their similarity values will not change).
Thanks for your explanations! So there is a mechanism within approximate_distribution()
that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them. Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?
So there is a mechanism within approximate_distribution() that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them.
That's correct!
Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?
Yes! In practice, it calculates all the similarities and then just ignores those that do not exceed the threshold but the result is the same.
Have you searched existing issues? 🔎
Desribe the bug
Dear creators of BERTopic, Thanks for your work and this package is amazing. I have been using it for a long time. However, I found some documents (no matter whether they are used to train the model) have zero topic distributions for all topics created by BERTopic after applying
approximate_distribution()
function on them. It means that the topic distribution matrix produced byapproximate_distribution()
has some rows having sum of 0. Codes below did several things: (1) build a simple setup for a BERTopic model with PCA and KMEANS (fromcuML
) as the dimension reduction and clustering technique. (2) define a splitting function to split documents and pre-caculated embeddings. (3) fit the model on training data and compute topic distributions for both training and testing data set.If more information is needed, please let me know. Thanks!
Reproduction
BERTopic Version
0.16.2