MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.16k stars 764 forks source link

Probabilities table has topics out of order #1024

Closed zilch42 closed 1 year ago

zilch42 commented 1 year ago

Hi there,

I'm quite confused by the probabilities table produced using calculate_probabilities=True. I think in some cases the topics are all out of order.

I've processed the sample dataset with 5 topics so that the resulting table is easier to interpret, and the embeddings calculated separately so the topic clustering is quicker to rerun. Neither of these steps change the behaviour I'm seeing.

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

topic_model = BERTopic(embedding_model=embedding_model, 
                       nr_topics=5,                           
                       calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs, embeddings)

Resulting topics:

topic_model.get_topic_info()

image

I've joined the probabilities table to the document info table:

publication_info = topic_model.get_document_info(docs)
publication_info =  pd.concat([publication_info, pd.DataFrame(probs)], axis=1)
publication_info.query('Topic != -1').groupby('Topic').head(5).sort_values('Topic')

I would expect the probability of the chosen topic to appear in the corresponding column of the probabilities table. It appears that it always does if the probability of the chosen topic = 1. If the probability is < 1, it may appear in another column. In the example below, that is usually column 0 (but not always).

probabilities (I have cut some unnecessary columns from the screenshot)

I should note that due to the stochastic nature of the process, this behaviour is variable. It is not always column 0 where the probability ends up, and I have seen the probabilities in the correct columns occasionally, so you may need to run it a few times if you don't see it happening, but I would say I'm not seeing them not line up 4 times out of 5.

bertopic 0.14.0

MaartenGr commented 1 year ago

In the docstrings of calculate_probabilities you can find the following information:

NOTE: This is an approximation of topic probabilities as used in HDBSCAN and not an exact representation.

This enables the soft clustering capabilities of HDBSCAN as described here. This soft clustering approach can be viewed as a post-hoc approximation of the probabilities and as such try to estimate the probabilities which give in to different probabilities. You can read more about that in the link about soft clustering.

zilch42 commented 1 year ago

Thanks, I've had a read of that page. That make sense but I don't think it answers my query. That documentation states that "The probability value at the ith entry of the vector is the probability that that point is a member of the ith cluster". My point is that that's not holding true.

If we take the first entry of topic 3 in the table above (index 2942) as an example, it has been assigned to topic number 3 with a probability of 0.306. Therefore that probability value (approximately) should appear in column 3 of the probabilities matrix, but it instead appears in column 0 and the probability in column 3 is much lower.

Based on the highest probability in the probabilities matrix, it should have been assigned to topic 0, not 3. (I'm not here saying the topic assignment is incorrect - the topic assignments look amazing - only that the probability vector ordering is mixed up)

I'm assuming here that each document (that isn't an outlier) is assigned to the topic that it has the highest probability of being a member of, and that the columns of the probability matrix align to the topic numbers. If either of those aren't true that may be where I'm going wrong.

MaartenGr commented 1 year ago

What I meant with my message is that the soft clustering is merely an approximation and that it may happen that the highest probability will not match a given cluster since it is an post-hoc approximation of the cluster-document probabilities. As such, it is not surprising that the index of the highest probability does not match the assigned topic since it approximates the probabilities which are not inherent to the cluster assignment.

zilch42 commented 1 year ago

Ah I see what you're saying. I did misunderstand your initial comment. So HDBSCAN does a hard clustering step to determine the topics and then a soft clustering step to determine the probability of membership, and the two processes are independent? That make more sense.

Still, the results I'm seeing are surprising. Using the dataset I mentioned in #1006 with five very distinct topics, hdbscan clusters them near perfectly, but then the probability matrix suggests that all of my documents about cows have a 70-100% chance of belonging to the cluster about immunology; and all of my documents about immunology have a 70-100% chance of belonging to the cluster about rocks. This seems systematic, not simply fuzzy as I might expect if the soft clustering is not an exact representation.

I have been looking through the code and I think the issue lies in the _sort_mappings_by_frequency step. If I just remove the following lines from .fit_transform() in _bertopic.py, then the probability matrix looks far more reasonable.

# Sort and Map Topic IDs by their frequency
        if not self.nr_topics:
            documents = self._sort_mappings_by_frequency(documents)

My documents about cows then have a high probability of belonging to the cows cluster; my rocks documents have a high probability of belonging to the rocks cluster, etc.

My guess is that when _sort_mappings_by_frequency() is run after topics, probabilities are computed, probabilities doesn't get told about the new topic order.

MaartenGr commented 1 year ago

Thanks for figuring where this all might be going wrong. Based on what you're seeing, it would indeed seem that something systematic is at fault here. I would be surprised if it is a result of the ._sort_mappings_by_frequency step as later on, the probabilities are mapped using:

https://github.com/MaartenGr/BERTopic/blob/1ee8141d65063a37f6ee3fd56b30e3f9e2f43d6e/bertopic/_bertopic.py#L376

My guess is that the mapping at that step is not working correctly but as far as I know, it has always worked and I am not sure what changed between the last few versions that would explain this.

Is the dataset that you mentioned by chance publicly available? If so, would you mind creating a reproducible example of what is happening here? If I can re-create the issue, then perhaps fixing it becomes much easier.

zilch42 commented 1 year ago

Certainly! It's not a public dataset but happy to share. five_topics_dataset.csv

import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

# Load data 
publications = pd.read_csv("five_topics_dataset.csv")
sentences = publications.Sentences.values

# Calculate embeddings so refitting the model can be done quicker 
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(sentences, show_progress_bar=True)

# umap min_dist 0.1 helps five topics naturally cluster together without reducing topics 
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.1, metric='cosine')

vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))

# Fit Model
topic_model = BERTopic(
    embedding_model=embedding_model, 
    umap_model=umap_model,
    vectorizer_model=vectorizer_model,
    min_topic_size=100, 
    calculate_probabilities=True)
topics, probs = topic_model.fit_transform(sentences, embeddings)

# add topics and probabilities to publications df
publication_info = topic_model.get_document_info(sentences, df=publications)
publication_info = publication_info.drop(columns=['Sentences', 'DOI', 'Top_n_words'])
publication_info =  pd.concat([publication_info, pd.DataFrame(probs)], axis=1)

# look at examples for each topic 
publication_info.sort_values(['Topic', 'Title']).groupby('Topic').head(5)

Thanks for taking a look at this

MaartenGr commented 1 year ago

It took a while but I think I know what is happening.

To start off, you are using a nice trick for reducing outliers by specifying the min_dist parameter in UMAP. This parameter can be quite tricky to get right, especially combined with min_topic_size, so it is not often that I see users using this. In your case, you managed to get no outliers at all and still a nice separation of the clusters. What this means is that you are essentially using an HDBSCAN model that does not produce any outliers at all. This is quite rare.

As it so happens, when the topics are sorted by their frequency, the 2D-probabilities get mapped to those topics also. At least, that is was should be happening. Typically, these 2D-probabilities are only generated with HDBSCAN, which almost always produces outliers to some extent. This assumption can be found in the following lines:

https://github.com/MaartenGr/BERTopic/blob/1ee8141d65063a37f6ee3fd56b30e3f9e2f43d6e/bertopic/_bertopic.py#L3336-L3338

I think, but I am not sure yet, that simply removing and self.get_topic(-1) and replacing - 1 with - self.outliers here should do the trick and fix your probabilities being mapped incorrectly. I would have to test this out a bit further but this currently seems to be the culprit.

zilch42 commented 1 year ago

Ah interesting. Thanks for tracking it down. And sorry, I thought I had checked to make sure my UMAP settings weren't the culprit but I must've still not been generating any outliers. Apologies if I led you a little astray in that.

dkirman-re commented 1 year ago

I've been following this thread, very interesting edge case. Just for clarification I believe the the -1 should be replaced with - self._outliers unless there's another local variable that I'm not privy to?

MaartenGr commented 1 year ago

Yes, I believe it should be removing and self.get_topic(-1) and replacing - 1 in len(set(mappings.values()))) - 1)) with - self.outliers but there might be a few more things that need to be replaced. These, however, seemed the most obvious.