MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

reduce_topics assigns many documents to -1 #556

Closed drob-xx closed 2 years ago

drob-xx commented 2 years ago

From what I can see from both experience and in the code reduce_topics() reassigns to -1 frequently. Is this the expected behavior? If I'm understanding the overall picture, topic clusters are selected based on the HDBSCAN results and documents are assigned to -1 based on a low likelihood of belonging to an identified cluster. Then these clusters are aggregated and a c_tf_idfscore is calculated for the entire topic. When doing the reduction, the cosine similarity of the topic being reduced is compared with all of the other topics and then assigned to the most similar topic. It seems counter-intuitive that if a particular document was sorted as part of a valid cluster by HDBSCAN, but then discounted per the similarity score during the reduction. It feels like there is a mismatch between doing the initial cluster assignment in a way that captures non-symmetric groupings but then using a Euclidean calculation to determine similarity and therefore topic assignment. While not perfect, wouldn't it be reasonable to omit -1 as a potential assignment?

MaartenGr commented 2 years ago

If I'm understanding the overall picture, topic clusters are selected based on the HDBSCAN results and documents are assigned to -1 based on a low likelihood of belonging to an identified cluster. Then these clusters are aggregated and a c_tf_idf score is calculated for the entire topic.

Yes, this is correct! Do note that although we are using HDBSCAN to cluster similar documents together, they are not yet topics until after the c-TF-IDF scoring.

When doing the reduction, the cosine similarity of the topic being reduced is compared with all of the other topics and then assigned to the most similar topic.

There are two ways that topic reduction is performed within BERTopic. The first is manual topic reduction which indeed uses cosine similarity to find topics that are closely related to one another. It starts from the least frequent topic and tries to merge it with the most similar topic. That way, we limit the number of micro-topics that are created. You can do this by running topic_model.reduce_topics(docs, topics, nr_topics=30).

The second way is automatic topic reduction which uses HDBSCAN on the entire c-TF-IDF matrix (or c-TF-IDF weighted topic embeddings) in order to merge similar topics together. You can do this by running topic_model.reduce_topics(docs, topics, nr_topics="auto").

It seems counter-intuitive that if a particular document was sorted as part of a valid cluster by HDBSCAN, but then discounted per the similarity score during the reduction. It feels like there is a mismatch between doing the initial cluster assignment in a way that captures non-symmetric groupings but then using a Euclidean calculation to determine similarity and therefore topic assignment. While not perfect, wouldn't it be reasonable to omit -1 as a potential assignment?

There are definitely pros and cons to both omitting and keeping -1 as a potential assignment. The reason for keeping -1 as a potential assignment is that if you have very diverse topics and you want to reduce them, it might not make sense to merge them. Let's say that you are iteratively merging topics until you have 10 topics left. At some point, you will most likely encounter a topic, let's say topic a, that is not similar at all to all other topics. Here, you will have a choice. Do you merge topic a with another, very different, topic or do you assign topic a to -1? If you choose the former, then there is a good chance that the resulting topic contains quite a bit of noise and will be difficult to interpret since it contains two very different topics. The latter, however, will make sure that all topics are relatively "clean" but will indeed remove topic a as it will be assigned to -1.

In order to keep the topics as interpretable as possible, I opted for the latter. It focuses, in a way, on precision and thereby making sure that whatever topic we generate is as accurate/coherent as possible instead of trying to extract as many topics as we want which could result in less accurate/coherent topics.

Also, note that the topic reduction techniques are based on the topic representations and not so much on the documents itself as we assume that the aggregation of documents does not represent a topic until after we calculated the c-TF-IDF representations.

drob-xx commented 2 years ago

@MaartenGr - Thanks for the as usual very helpful explanation...

In order to keep the topics as interpretable as possible, I opted for the latter. It focuses, in a way, on precision and thereby making sure that whatever topic we generate is as accurate/coherent as possible instead of trying to extract as many topics as we want which could result in less accurate/coherent topics.

Yes. This is what I assumed, and it makes sense if that is the tradeoff. I have been looking into this issue because on the face of it it looks like having a large number of -1s is sub-optimal. I tried a quick hack - taking all the topics above the cutoff - and reassigning them with a document-wise cosine_similarity with something like this:

docs  = documents['Document'].to_list()
tops = documents['Topic'].to_list()
num_topics = 40
c_tf_idf = CleanAllModel.c_tf_idf

newdocs = [doc if top > 41 else '' for doc, top in zip(docs, tops)]
tranformedDocs = CleanAllModel.vectorizer_model.transform(newdocs)
sims = cosine_similarity(tranformedDocs, c_tf_idf)
newTopics = [np.argmax(row) if top > num_topics+1 else None for row, top in zip(sims[:,1:num_topics+1], tops)]
newProbabilities = [row[np.argmax(row)] if top > num_topics+1 else None for row, top in zip(sims[:,1:num_topics+1], tops)]

The results weren't great. The numbers were obviously much better because I forced everything into a topic, but of course when I plotted the new distribution it was messy and not compelling. I played for a bit with setting a probability threshold which improved things a bit - but this didn't seem like a viable approach.

I wound up going back to one of your often explained solutions - setting the HDBSCAN params min_cluster_size and min_samples to reduce the number of topics to much better results. It was not only better than my hack, but also noticeably different than doing the reduction using nr_topics. In my case I determined that 51 was a good number of topics.

To compare I set nr_topics to the this value and 86K documents, 64% were -1, when I set max_cluster_size and min_samples to 570and 255 the -1s fell to 55.6K or 41%. Also, the overall distribution was much more coherent when viewing the results in tsne visualizations of the UMAP reduction.

Of course calculating what combinations of max_cluster_size and min_samples get you to a given topic size is more complicated and time consuming than simply choosing a value. In my case I brute forced the combinations until I got what I wanted. I'm pretty sure a general solution to do these calculations are straight forward. They also ran pretty fast on colab (no gpu). Would you be interested in exploring a general solution for this use case?

MaartenGr commented 2 years ago

I have been looking into this issue because on the face of it it looks like having a large number of -1s is sub-optimal.

On the one hand, you want to minimize the number of -1s as you typically prefer a document to be assigned to a topic, on the other hand, however, assigning too many documents to a cluster might introduce some noise which could hurt topic representation.

Keeping the above in mind, there is one solution that might work best in this use case. You train the model as you would always do, without optimizing the number of -1 documents, but instead, you make sure to set calculate_probabilities=True. Setting this value to True does not change the topics that are being created and still results in many -1 documents. However, the topics that are created are rather clean and should contain little noise.

Since we set calculate_probabilities=True, it now returns probs that contain the topic-document probability matrix. Using that, we can now force some of the -1 documents to non-outlier topics by setting a probability threshold, similar to what you have done with the c-TF-IDF-based similarity matrix. You can find a bit more about that here. The code for doing so can be found here:

import numpy as np
probability_threshold = 0.01
new_topics = []
for topic, prob in zip(topics, probs):
    if topic == -1:
        if max(prob) >= probability_threshold:
            new_topics.append(np.argmax(prob))
        else:
            new_topics.append(topic)
    else:
        new_topics.append(topic)

A second method for approaching this is by using a different clustering algorithm that does not support outlier selection, like k-Means. You can find a bit more about that here.

Of course calculating what combinations of max_cluster_size and min_samples get you to a given topic size is more complicated and time consuming than simply choosing a value. In my case I brute forced the combinations until I got what I wanted. I'm pretty sure a general solution to do these calculations are straight forward. They also ran pretty fast on colab (no gpu). Would you be interested in exploring a general solution for this use case?

It would indeed be possible to use an optimization algorithm, with the number of -1 documents as an objective measure, to find the best combination of min_cluster_size and min_samples but it is something worth exploring outside of BERTopic as the solutions above might be a bit more straightforward.

drob-xx commented 2 years ago

Thanks for the detailed explanations which support things you have said consistently since I've been here. Right now I'm focused on using clustering algorithms like hdbscan and PaCMAP which seem better suited for the task at hand. I understand that k-means, as a partitioning approach, would eliminate the outlier issue.

I was finally able to get propabilities=True to work and after twenty hours of colab+ processing have the output. This is compared to 20 minutes with probabilities=False. I understand the approach and it may be the best solution, but the processing overhead is significant (I'm pursuing the CuML but that will take some time as it won't run on colab).

I think that overall that there are two issues that come to mind. The first is about assumptions between the relationship between the topic model and the corpus. When I began work on this I assumed that there was a two-way linkage between the corpus (as clustered) and the resulting topic model. By this I mean that the relationship is close enough that you can easily go back and forth. The more I work with topic modeling (writ large) I'm beginning to see that in some approaches it is more of a one-way relationship where the objective is to get a topic model, not segment the corpus.

The second issue is a question about what we mean when we talk about noise? When I visualize the BERT embeddings it is striking how well everything is related together. In a tsne visualization of the underlying BERT embeddings you can compare documents to their neighbors and see all sorts of useful connections. If we think of noise as meaningless information, it seems like there is very little of that, much less than is represented in the -1 categorization.

If my intuitions above are correct, then I would say that the decisions embedded in reduce_topics leans towards what I'm calling a one-way linkage where the emphasis is on creating the best topic representation possible, and not being as concerned about traversing back to the corpus. In this vein I can see that noise might be documents that enter into the topic cluster that throw off the topic model.

From what I'm seeing, at least with my corpus, is that there really are very few outliers that could be classified as meaningless information. What seems to be happening is that the clustering algorithms have trouble identifying clearly defined clusters and tend to exclude some texts as "outliers" when most humans wouldn't see them that way.

At this point I am doing more experimentation, my take-away so far is that the HDBSCAN minimum_cluster_size is quite important as it will significantly reduce the number of -1 classifications while substantially improving the clustering. I hope that this is a detail I didn't just miss along the way.

Thanks as always for all your time and attention. BERTopic is a great package overall and really moves the ball forward in topic modeling. in particular c-TF-IDF seems like a significant contribution to the community at large.

MaartenGr commented 2 years ago

@drob-xx Indeed, although using calculate_probabilities=True is the easiest and likely to be the best fitting, the computation time to run this is definitely a bottleneck!

The first is about assumptions between the relationship between the topic model and the corpus. When I began work on this I assumed that there was a two-way linkage between the corpus (as clustered) and the resulting topic model. By this I mean that the relationship is close enough that you can easily go back and forth. The more I work with topic modeling (writ large) I'm beginning to see that in some approaches it is more of a one-way relationship where the objective is to get a topic model, not segment the corpus.

Yes, BERTopic started out as a one-way street (embeddings -> umap -> hdbscan -> c-TF-IDF -> MMR) which some optimization here and there to focus on that pipeline. Those individual steps have become more and more independent from one another (from an API perspective) but that underlying assumption can still be seen. It would indeed be interesting if you could easily go back and forth but I am not entirely sure how to approach that (yet).

The second issue is a question about what we mean when we talk about noise? When I visualize the BERT embeddings it is striking how well everything is related together. In a tsne visualization of the underlying BERT embeddings you can compare documents to their neighbors and see all sorts of useful connections. If we think of noise as meaningless information, it seems like there is very little of that, much less than is represented in the -1 categorization.

There is a balance to be found between the number of outliers and non-outliers and the result will, unfortunately, almost always be unbalanced. With HDBSCAN you typically get too many outliers, whilst with k-Means too few are generated. I think in this scenario, noise could be defined as information that does not primarily contribute to the topic representation and generation. As you mention, that would mean that we can often find documents that do not truly fit into one or any cluster but the actual number would indeed depend on the assumption that you have about the data.

Do note that t-SNE is not meant for clustering but gives a general overview and might overestimate the differences/overlap found between points.

At this point I am doing more experimentation, my take-away so far is that the HDBSCAN minimum_cluster_size is quite important as it will significantly reduce the number of -1 classifications while substantially improving the clustering. I hope that this is a detail I didn't just miss along the way.

Although there is a risk of creating micro-clusters by significantly lowering min_cluster_size, using nr_topics="auto" would help in remediating that just a little bit since the -1 cluster is ignored when using nr_topics="auto" since HDBSCAN does not forcefully merge topics.

Also, a quick note. Thanks for the extensive write-down 😄 These discussions definitely help in understanding how users view the model but also how to improve upon it!

drob-xx commented 2 years ago

Do note that t-SNE is not meant for clustering but gives a general overview and might overestimate the differences/overlap found between points.

Yes. I mentioned the t-sne because using it to plot 2D points for a visualization gives a good starting place for understanding the distribution/organization of documents as represented in the embeddings. Visualizing the umap are also helpful, but the embeddings with t-sne, while distorted to create a symmetrical representation, are illuminating. This is what led me to re-think what was meant by noise because in my corpus it is arguable that there is very little noise in the way its often discussed.

I'm relieved to hear that you find these long-winded posts helpful, I was very hesitant to post this kind of content in issues, but you've made it clear that is your preference. Personally, I would love to hear about other's detailed experiences as I'm sure I would learn a lot.

MaartenGr commented 2 years ago

Yes. I mentioned the t-sne because using it to plot 2D points for a visualization gives a good starting place for understanding the distribution/organization of documents as represented in the embeddings. Visualizing the umap are also helpful, but the embeddings with t-sne, while distorted to create a symmetrical representation, are illuminating. This is what led me to re-think what was meant by noise because in my corpus it is arguable that there is very little noise in the way its often discussed.

Ah, right, that makes sense! The definition of noise here is definitely tricky as due to the nature of fitting vs. assigning. During fitting, it is okay to remove a lot of noisy documents for the representation but that does not mean that those documents should have no topic assigned.

I'm relieved to hear that you find these long-winded posts helpful, I was very hesitant to post this kind of content in issues, but you've made it clear that is your preference. Personally, I would love to hear about other's detailed experiences as I'm sure I would learn a lot.

No problem! Any information, whether it is suggestions, issues, or discussions, is most welcome on the issues page.

Sghosh023 commented 2 years ago

can anyone tell me is there anyway(or any in-built method of BerTopic) by which I can avoid assigning documents to Topic -1?? Basically my requirement is if I pass nr_topics=10, then only 10 topics would be created and no Topic -1 would get generated. Any suggestions regarding this issue is welcomed.

MaartenGr commented 2 years ago

@Sghosh023 Yes, you can! The -1 topic is generated through the use of HDBSCAN, which identifies outliers. You could use k-Means instead to generate those 10 topics without the -1 topic. To do so, you would only need to run the following:

from bertopic import BERTopic
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=10)
topic_model = BERTopic(hdbscan_model=cluster_model)

You can find more information about that here.

Sghosh023 commented 2 years ago

Thanks for the solution @MaartenGr . I have two more doubts, it would be nice if you can clear them -

  1. if we use the above way of KMeans clustering, then the number of cluster would be pre-defined, right???(like the way you're passing n_clusters =10). When we are using HDBSCAN then no. of clusters are getting created dynamically, can we have that same feature while using KMeans clustering?
  2. Also I saw that you're using Sentence Transformer all-MiniLM-L6-v2 model(in the _bertopic.py file), that model takes into account only token_size of 128, then what would happen when we will have bigger documents, will this model work with bigger documents?
MaartenGr commented 2 years ago

There are several ways of approaching it. You can find most of them, using HDBSCAN, in the documentation here. Essentially, you would be using calculate_probabilities=True to create a document-topic probability matrix from which you can assign outliers (-1) to regular topics. This would mean that the model still creates the topic representations based on the documents in the non-outliers topics, but allows you to re-assign the outliers to regular topics.

A second option is using some new features that are upcoming and which you can install from the PR here. When you use k-Means with a large k value, like 200, you can iteratively merge those that you feel belong together. In the linked PR, there are a number of options for hierarchical topic modeling that help you merge topics that are related to one another.

MaartenGr commented 2 years ago

Now that I think about it, you could even use k-Means with a large k and set nr_topics="auto" in BERTopic to reduce them automatically.

Sghosh023 commented 2 years ago

Thanks @MaartenGr , I guess you have provided a solution for my first query. I will try that out and let you know if I face any challenges. Also would appreciate if you could take up my second query (from the last comment) & let me know what could be a possible solution for that.

MaartenGr commented 2 years ago

@Sghosh023 Apologies! No, this model will work best with either sentences or paragraphs but not longer documents. You can find a bunch of other models that work quite well here, some of which have larger token sizes. Other than that, it might be worthwhile to go through one of the embedding models here. There are a number of models that can be used for that purpose. For example, you could use Flair to average word embeddings or even use TF-IDF representations for the entire document.

Sghosh023 commented 2 years ago

Thanks @MaartenGr for giving different possible solutions. Just need one more confirmation if I use TF-IDF representation for the entire long document, How can I pass that to the BerTopic model & make sure that the embedding creation process doesn't takes place using any sentence transformer model. Also, the next two steps of clustering & topic representation, will that remain intact if I pass TF-IDF representation instead of any embedding from transformer.

MaartenGr commented 2 years ago

@Sghosh023

How can I pass that to the BerTopic model & make sure that the embedding creation process doesn't takes place using any sentence transformer model.

You can do that by following the documentation here. In essence, you precalculate whatever embeddings you have, whether that is TF-IDF or something else, and simply pass that to your topic model like so:

topics, probs = topic_model.fit_transform(docs, embeddings)

Also, the next two steps of clustering & topic representation, will that remain intact if I pass TF-IDF representation instead of any embedding from transformer.

Yes, they remain completely intact. You are only changing the embeddings with the step above.

Sghosh023 commented 2 years ago

Thanks @MaartenGr for the solutions, TFIDF embeddings are also working fine. But one thing I have seen even when you pass the same data to the BerTopic model the output (topics) differs time to time, is there any parameter by which I can fix it? I mean when I pass the same data to the model every time the output for that would remain constant.

MaartenGr commented 2 years ago

@Sghosh023 This is related to UMAP, which is a stochastic model that generates a different result each time. To control for that, I would advise reading through this FAQ for more information on that and how to prevent it.

MaartenGr commented 2 years ago

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

hubgitadi commented 2 years ago
topic

Hi Maarten, even if we modify the topics based on probabilities as you described 'topic_model.transform(XXX)' would still give us topic as "-1" only, right? in case of prediction of the topic on a new text/document? Can this modification be included in the topic_model object, so that when we will predict on a new text the model will give us the new topic instead of -1?

Thoughts?

--hubgitadi

MaartenGr commented 2 years ago

@hubgitadi You could use .update_topics to update the topics and map them together. Do note though that if two outliers get mapped to different topics that mapping cannot be saved in BERTopic since it is not clear to which topic an outlier should be mapped to as a default. Instead, it might be worthwhile to read through the FAQ as there are quite a number of tips for reducing outliers.