MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Bertopic frequency count of a topic does not match the count of (topic_model.get_representative) for the same topic? #886

Closed Alla-Abdella closed 1 year ago

Alla-Abdella commented 1 year ago

Hello Maarten, Thanks for your great work on topic modeling. I just realize that when you compare the count of documents from topic_model.get_representative_docs(#topic 0) to the count of topic 0 frequency in the topic_model.get_topic_info() are not equal.

For example: opic_model.get_representative_docs(0):-------> gives 72 documents

df = topic_model.get_topic_info() df[df.Topic ==0] --------> gives 1000 documents

I was expecting to see the two counts are equal, but I'm not sure if this is an error from my code or something I misunderstood about the two functions. Thanks

MaartenGr commented 1 year ago

The function .get_representative_docs does not save all documents in a topic but merely the most representative documents. Those are the documents that best describe a certain topic using either c-TF-IDF similarity or using HDBSCAN's exemplars. To prevent BERTopic from being a document database, not all documents are saved within the topic model. Instead, we focus on only the most representative documents. If you want to extract the documents together with their topics, you can do something like the following:

import pandas as pd
df = pd.DataFrame({"Doc": docs, "Topic": topics})

In a few weeks' time, I will release version v0.13 which will allow you to extract all documents together with some metadata, like the topics, probabilities, representative documents, etc. using the following function:

doc_info = topic_model.get_document_info(docs)
Alla-Abdella commented 1 year ago

Thanks, Maarten for your response. I appreciate it. I'm so excited about v0.13. I have another question based on that, how does Bertopic decide that those 72 docs are the most representative out of the 1000 docus for Topic#0. Also, another question about the same topic. I noticed there are many false positives within Topic#0 where some of the docs are not belonging to the topic at all, yet they were tagged with topic#0. Is there a way of minimizing these false positives? or at least having a metric to identify them?

MaartenGr commented 1 year ago

There are actually 3 documents that are most representative of a given topic. However, when you merge or reduce topics, the representative documents get also merged. In other words, the 72 representative docs are actually the culmination of representative docs for multiple topics.

It depends on what you mean with false positive. Your topic 0 is most likely a combination of multiple smaller topics which might explain the assignment.

Alla-Abdella commented 1 year ago

In this scenario, there are comments (i.e. documents) that have been labeled with Topic 0, but they actually only match one word of the topic 0 names and have a different context. For example, the name of Topic 0 is "0_issues_letters_closed_case." Upon examination, many of these comments are about something unrelated to the topic name. I called these comments as false positives, as they were manually identified by me. I am interested in finding a way to automatically identify or quantify these false positive comments as a percentage. Can you provide any ideas on how to do this? I would like to hear your thoughts. Thanks

MaartenGr commented 1 year ago

Automatically identifying is rather difficult since you manually decide what should and what should not be related to a certain topic. This means that there is some subjectivity involved here so it is difficult to then create some sort of objective evaluation metric. Having said that, it might be worthwhile to understand how these documents came to be and focus on that. For example, if you have 72 representative documents in a topic and each topic contains 3 representative documents, then you have merged 24 topics together. That is quite a number of topics! Most likely, you used something such as nr_topics=N to make sure you get the number of topics you are looking for. However, when you merge that many topics together, there is a risk of assigning documents to such a topic that are not the best for that topic. Instead, it might be worthwhile to let HDBSCAN create fewer topics before merging them. You can do this by increasing the min_cluster_size value to prevent micro-clusters from being created.

Another option is to use probs to find all documents that are assigned to topic 0 but have a low probability. Perhaps those match your intuition of what should and should not be assigned to that topic.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!