Most documents assigned to -1 topic

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.16k stars 764 forks source link

Most documents assigned to -1 topic #423

Closed ViktoriaSpaiser closed 2 years ago

ViktoriaSpaiser commented 2 years ago

Dear Maarten, many thanks for this great module, we are exploring it currently in our research project and it's brilliant. However one issue we encountered and where we thought you may be able to provide some guidance, advice, is that so many of the documents (in our case tweets ) are "dumped" into topic -1 (or sometimes -1 and 0). For instance, in one of our latest analysis runs around 120.000 tweets from over 145.000 were classified as -1 or 0, both topics that were mainly represented through stop word lists. We are currently using the flair word embedding + document pool embedding model as this has been recommended for short text, but we did also experiment with other models, like the default one for English language for instance, and this remains an issue. We also tried to remove strop word as you recommend in one of your posts for decreasing memory pressure, but this did not help much, as there is still a large -1 topic that has most of documents in it, even though it is now no longer defined by stop words but by some other words that are general to the overall topic of the data (in our case UN Climate Change Conference in a given year, COPxx). Looking at probabilities our understanding is, that any document that has a low probability (there seems to be some threshold around 0.05?) for any of the extracted meaningful topics, is basically put in the -1 topic, even if it has a 0 or very low probability for it. We also experimented with various other hyper-parameters to optimise the outcome, but did not find a way to solve that issue specifically. Should we use a different specific embedding model or is there anything else you think we can do to avoid this issue? We would greatly appreciate your response. Many thanks!

MaartenGr commented 2 years ago

Hi Viktoria,

That is quite a large percentage being seen as outliers! Let's see if we can find a way to reduce the number of outliers.

First, could you share how you have initialized BERTopic and with which parameters? Perhaps there is a combination of parameters that are not working well together resulting in many outliers being generated.

Second, there is some documentation about reducing outliers that you can find here. Have you had the chance to try out the suggestions there for reducing outliers?

Third, I would very much advise sticking to SentenceTransformer model especially since you are using short documents (tweets). More specifically, assuming all documents are in English, you can use all-mpnet-base-v2 as the embedding model since it is quite accurate in representing documents.

Finally, it might be helpful to generate the embeddings beforehand so you can quickly iterate on the parameters. You only need to generate the embeddings once and then you can try out some parameters. The general approach would then be something as follows:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# Prepare embeddings
sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(my_un_docs)

# Train our topic model using our pre-trained sentence-transformers embeddings
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(my_un_docs, embeddings)

ViktoriaSpaiser commented 2 years ago

Hi Maarten, many thanks for your swift and thorough response, this is much appreciated.

Here is how we initialized BERTopic in our latest analysis run:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

my_stopwords = frozenset(list(["rt","RT", "&", "amp", "&amp", "http","https", "http://", "https://", "fav", "FAV"]))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words = my_stopwords, min_df=10)

from bertopic import BERTopic
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])

topic_model_fl150 = BERTopic(embedding_model=document_glove_embeddings, nr_topics = 'auto',
min_topic_size=150, top_n_words=10, vectorizer_model=vectorizer_model).fit(tweetslist1)

However, we also experimented with other hyper-parameter values, e.g. various numbers between 10 and 150 for min_topic_size and a few other embedding models (all-MiniLM-L6-v2, roberta-base, fasttext-wiki-news-subwords-300), but usually we would still get over half of the documents being discarded as outliers.

Many thanks for your three suggestions, this is very helpful. We will now try out what you suggested and report back whether we were able to significantly reduce the number of outliers this way. This will hopefully be also useful to other users.

MaartenGr commented 2 years ago

I can see several things happening here. First, by setting nr_topics="auto" it might happen that certain topics will be merged into the -1 class, thereby resulting in a large number of outliers. I would advise not setting nr_topics and playing around with the min_samples and min_cluster_size as shown below:

from bertopic import BERTopic
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', 
                        cluster_selection_method='eom', prediction_data=True, min_samples=5)
topic_model = BERTopic(hdbscan_model=hdbscan_model)

Second, the Glove embeddings typically result in lower quality clusters and sentence-transformer models often outperform these embeddings. Using "all-mpnet-base-v2" or "all-MiniLM-L6-v2" is definitely suggested.

Third, using calculate_probabilities=True you can assign every document to a topic by simply extracting the maximum probability of a document belonging to a topic. You can do this is as follows:

import numpy as np
from bertopic import BERTopic

# Train topic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(tweetslist1)

# Map each document to a non-outlier topic
new_topics = list(np.argmax(probs, axis=1)[1:])

ViktoriaSpaiser commented 2 years ago

Thank you Maarten for reviewing our implementation and for your further comments, we will implement all your suggestions now, including removing nr_topics specification, which we had originally to reduce the number of topics. We will try and report back.

fsysean commented 2 years ago

new_topics = list(np.argmax(probs, axis=1)[1:])
Map each document to a non-outlier topic

new_topics = list(np.argmax(probs, axis=1)[0:]) Should it start from 0？

MaartenGr commented 2 years ago

@fsysean Yes, it should indeed start from 0. I had posted this code in another issue where it did need to start from 1 since it needed to have at least one outlier in there. So you can use 0 here.

NicoleNisbett commented 2 years ago

Dear Maarten,

I'm working with @ViktoriaSpaiser on her research project using your Bertopic package. Many thanks for your help and advice so far with this, we have retrained the model using your suggestions as below:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

my_stopwords = frozenset(list(["rt","RT", "&", "amp", "&amp", "http","https", "http://", "https://", "fav", "FAV"]))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words = my_stopwords, min_df=10)

min_cluster = 200 
hdbscan_model = HDBSCAN(min_cluster_size= min_cluster, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=round(min_cluster/2))

sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(tweetslist1)

COP20_model = BERTopic(hdbscan_model=hdbscan_model, embedding_model = sentence_model,vectorizer_model=vectorizer_model, calculate_probabilities=True)

topics, probs = COP20_model.fit_transform(tweetslist1, embeddings)

However we are still getting the majority of documents assigned to the -1 topic category

Screenshot 2022-02-02 at 13 18 10

Also, before when we assigned a value to nr_topics and min_topic_size we were able to control the number of topics produced as you recommend larger values for large datasets. Are we able to still use these parameters or is there a similar recommendation for the min_cluster_size and min_samples instead?

ViktoriaSpaiser commented 2 years ago

Just to add, we did also try your other recommendation to reassign the outlier documents to other topics using the estimated probabilities and this works as a work-around, but is not necessarily optimal if the probabilities are rather low. We will now experiment with various values for min_samples to see whether we can further reduce the number of outliers. We also consider to going back to setting nr_topics to "auto" as removing this specification did not really help reducing the number of outliers and it seems like a useful option to reduce the number of topics extracted. Many thanks!

MaartenGr commented 2 years ago

Hmmm, I had hoped the number of outliers would have been reduced by now. Perhaps you can try lower values of min_samples to have a bit more control over the number of outliers generated. You can go as low as 5 just to see what happens with your data. You can find a bit more about that here.

One thing to note, have you checked which types of documents were assigned as outliers? Does it make sense that these were assigned as outliers?

Are we able to still use these parameters or is there a similar recommendation for the min_cluster_size and min_samples instead?

Fortunately, you can change these values separately to have some control over the outliers that are generated. So increasing min_cluster_size and reducing min_samples should not be an issue.

Just to add, we did also try your other recommendation to reassign the outlier documents to other topics using the estimated probabilities and this works as a work-around, but is not necessarily optimal if the probabilities are rather low.

Out of curiosity, why is this not optimal if the probabilities are low? Some points are likely to exist on the plane that separates several clusters and would then indeed have similar probabilities but a point would almost always be closer to one cluster compared to all others. By using the argmax procedure we are essentially going from a soft-clustering approach to a hard-clustering approach like k-Means.

The great thing about having a probability matrix is that you could also take a step in between and only select the argmax if the highest probability exceeds a certain point. As you mention that low probabilities might be an issue, we can assign outliers to clusters only if they exceed a threshold probability. Implementation would be something like this:

probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

The above would allow you to play around with the probability threshold until you get the number of outliers that you are satisfied with.

ViktoriaSpaiser commented 2 years ago

Hello again, just to close the issue, we thought we report the outcomes of our latest attempts and the conclusions we will draw for our analysis.

So we tried going as low as 5 for min_samples and surprisingly that did not reduce really the number of outliers much. We still had over 84.000 documents (out of over 145.000) being assigned to the -1 topic and we get a second smaller outlier topic 0.

The documents that are assigned to the outlier topic are not meaningless, we did check that, but at least some tend to be a bit more general.

So for our project we have decided to work with the probability matrix to assign the outlier documents to the topics for which they have the highest probability, as you have suggested (and your explanation that some points will be on the plane separating several topics, makes absolutely sense). We will use some threshold as you suggested, but we need to play around a bit to identify the optimal threshold. In any case, this seems to be the best way to deal with outlier documents and reduce the number of essentially discarded documents.

Many thanks again for your support throughout Maarten, it is much appreciated.

MaartenGr commented 2 years ago

@ViktoriaSpaiser Glad to hear that using the probability matrix is a reasonable solution in your project. Having said that, if you run into any other issues are questions that you might have, please let me know!

Also, thank you for addressing this issue as it might help other users understand both the pros and cons of using BERTopic.