MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.97k stars 747 forks source link

Calculating Probabilities on Zero-Shot Learning #2076

Open marcomarinodev opened 2 months ago

marcomarinodev commented 2 months ago

Have you searched existing issues? 🔎

Desribe the bug

Hi everyone, first of all I would like to thank @MaartenGr and all the contributors for this amazing project.

Recently I started looking at BERTopic as a method to classify some customer tickets into some categories defined intozeroshot_topic_list parameter. After fitting the model by calling fit_transform my goal was to look for each document, the probability of that document of belonging to all the topics (both predefined and generated ones).

probs is None after fit_transform, as expected and mentioned at the end of this page https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html#example. Therefore, I then called transform to get the probabilities.

Now I've got two questions I would like to answer:

Again, I would like to thank you for your effort in advance and I look forward to contribute as well if needed.

Reproduction

topic_model = BERTopic(
  calculate_probabilities = True,
  vectorizer_model = CountVectorizer(stop_words=default_stopwords + custom_stopwords),
  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True),
  embedding_model = sentence_model,
  min_topic_size = 50,
  zeroshot_topic_list = taxonomy_list,
  zeroshot_min_similarity = .80,
  representation_model = KeyBERTInspired(),
  verbose = True,
)

topics, _ = topic_model.fit_transform(global_ticket_descriptions, embeddings=embeddings)
_, probs = topic_model.transform(global_ticket_descriptions, embeddings=embeddings)
print(probs)

The output is:

2024-07-09 14:48:15,691 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
[[0.8463203  0.8880229  0.8636132  ... 0.8376566  0.8344382  0.82069004]
 [0.92492086 0.8977871  0.91309047 ... 0.9031642  0.90693504 0.9101905 ]
 [0.9009018  0.9072951  0.91581297 ... 0.9002963  0.8903491  0.8731015 ]
 ...
 [0.85506856 0.9055494  0.8743948  ... 0.8490986  0.8607102  0.8375366 ]
 [0.8783543  0.88571143 0.8983458  ... 0.8795974  0.87251174 0.8653137 ]
 [0.8480984  0.8751351  0.8680049  ... 0.8383051  0.8333502  0.8458183 ]]

BERTopic Version

0.16.2

MaartenGr commented 1 month ago

Apologies for the late reply!

is this one the right approach to get the probabilities?

This is one of the approaches to indeed get probability-like scores.

most of the documents have almost the same (high) probability among all the topics. Does this mean the clustering didn't fit that much the data? What do you suggest?

What you get out of the .transform are technically the similarity scores between the document and topic embeddings. Depending on the underlying embedding model, they may indeed be high but that shouldn't necessarily be a problem.

What you can do is apply softmax on these similarity scores to make sure they all sum to 1. You often get a much nicer distribution if you do that.

At some point, I want to create some additional parameters in .transform and .fit_transform that allow for choosing the type of probability/similarity score is being selected.

marcomarinodev commented 1 month ago

Hi @MaartenGr thanks for the reply.

So, as far as I understand, .fit_transform is not supposed to compute the similarity scores between document and topic embeddings on zero-shot mode, right?

Apart from that I could be a candidate for implementing the point you mentioned :)

MaartenGr commented 1 month ago

@marcomarinodev

So, as far as I understand, .fit_transform is not supposed to compute the similarity scores between document and topic embeddings on zero-shot mode, right?

During .fit_transform there are two steps. The first is assigning the zero-shot topics to documents using the similarity scores between document and zero-shot topics. The second is running the default BERTopic pipeline over all documents that could not be assigned to zero-shot topics. Therefore, it is a combination of computing the similarity scores between document and topic embeddings and running HDBSCAN (or some other clustering algorithm).

Apart from that I could be a candidate for implementing the point you mentioned :)

That would be great! Note that this feature requires to check whether it is actually possible to either assign using topic embeddings or the HDBSCAN method before actually doing so. With this feature, I believe that the user experience should be the main concern as the code should be nothing more than selecting either technique.

Essentially, it would be something like this:

# Using cosine similarity between topic and document embedding
topics, probs = topic_model.transform(MY_DOCS, method="cosine_similarity")

# Using the clustering algorithm (although I'm not convinced of the name since
# it can also be a logistic regression or even a knowledge graph)
topics, probs = topic_model.transform(MY_DOCS, method="cluster_algorithm")
marcomarinodev commented 1 month ago

During .fit_transform there are two steps. The first is assigning the zero-shot topics to documents using the similarity scores between document and zero-shot topics. The second is running the default BERTopic pipeline over all documents that could not be assigned to zero-shot topics. Therefore, it is a combination of computing the similarity scores between document and topic embeddings and running HDBSCAN (or some other clustering algorithm).

All clear now regarding the zero shot steps.

That would be great! Note that this feature requires to check whether it is actually possible to either assign using topic embeddings or the HDBSCAN method before actually doing so. With this feature, I believe that the user experience should be the main concern as the code should be nothing more than selecting either technique.

What are your concerns there? Calling the transform with a different similarity score would just recompute the similarities document-topic with the new score metric right ?https://github.com/MaartenGr/BERTopic/blob/2353f4c21d74e33e34e30dbae938304bff094792/bertopic/_bertopic.py#L591

MaartenGr commented 1 month ago

It indeed does nothing more than compute the similarity scores. It's just how you interface with either of these methods. Do you put the hyperparameter in the .transform function or even in the init for easier grid search (for those that use scikit-learn). What would be the name of the hyperparameter? How do you handle errors if you want HDBSCAN-based probabilities but it's not there? Things like that.