Closed zilch42 closed 1 year ago
In the docstrings of calculate_probabilities
you can find the following information:
NOTE: This is an approximation of topic probabilities as used in HDBSCAN and not an exact representation.
This enables the soft clustering capabilities of HDBSCAN as described here. This soft clustering approach can be viewed as a post-hoc approximation of the probabilities and as such try to estimate the probabilities which give in to different probabilities. You can read more about that in the link about soft clustering.
Thanks, I've had a read of that page. That make sense but I don't think it answers my query. That documentation states that "The probability value at the ith entry of the vector is the probability that that point is a member of the ith cluster". My point is that that's not holding true.
If we take the first entry of topic 3 in the table above (index 2942) as an example, it has been assigned to topic number 3 with a probability of 0.306. Therefore that probability value (approximately) should appear in column 3 of the probabilities matrix, but it instead appears in column 0 and the probability in column 3 is much lower.
Based on the highest probability in the probabilities matrix, it should have been assigned to topic 0, not 3. (I'm not here saying the topic assignment is incorrect - the topic assignments look amazing - only that the probability vector ordering is mixed up)
I'm assuming here that each document (that isn't an outlier) is assigned to the topic that it has the highest probability of being a member of, and that the columns of the probability matrix align to the topic numbers. If either of those aren't true that may be where I'm going wrong.
What I meant with my message is that the soft clustering is merely an approximation and that it may happen that the highest probability will not match a given cluster since it is an post-hoc approximation of the cluster-document probabilities. As such, it is not surprising that the index of the highest probability does not match the assigned topic since it approximates the probabilities which are not inherent to the cluster assignment.
Ah I see what you're saying. I did misunderstand your initial comment. So HDBSCAN does a hard clustering step to determine the topics and then a soft clustering step to determine the probability of membership, and the two processes are independent? That make more sense.
Still, the results I'm seeing are surprising. Using the dataset I mentioned in #1006 with five very distinct topics, hdbscan clusters them near perfectly, but then the probability matrix suggests that all of my documents about cows have a 70-100% chance of belonging to the cluster about immunology; and all of my documents about immunology have a 70-100% chance of belonging to the cluster about rocks. This seems systematic, not simply fuzzy as I might expect if the soft clustering is not an exact representation.
I have been looking through the code and I think the issue lies in the _sort_mappings_by_frequency
step. If I just remove the following lines from .fit_transform()
in _bertopic.py
, then the probability matrix looks far more reasonable.
# Sort and Map Topic IDs by their frequency
if not self.nr_topics:
documents = self._sort_mappings_by_frequency(documents)
My documents about cows then have a high probability of belonging to the cows cluster; my rocks documents have a high probability of belonging to the rocks cluster, etc.
My guess is that when _sort_mappings_by_frequency()
is run after topics, probabilities
are computed, probabilities
doesn't get told about the new topic order.
Thanks for figuring where this all might be going wrong. Based on what you're seeing, it would indeed seem that something systematic is at fault here. I would be surprised if it is a result of the ._sort_mappings_by_frequency
step as later on, the probabilities are mapped using:
My guess is that the mapping at that step is not working correctly but as far as I know, it has always worked and I am not sure what changed between the last few versions that would explain this.
Is the dataset that you mentioned by chance publicly available? If so, would you mind creating a reproducible example of what is happening here? If I can re-create the issue, then perhaps fixing it becomes much easier.
Certainly! It's not a public dataset but happy to share. five_topics_dataset.csv
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
# Load data
publications = pd.read_csv("five_topics_dataset.csv")
sentences = publications.Sentences.values
# Calculate embeddings so refitting the model can be done quicker
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(sentences, show_progress_bar=True)
# umap min_dist 0.1 helps five topics naturally cluster together without reducing topics
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.1, metric='cosine')
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))
# Fit Model
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
vectorizer_model=vectorizer_model,
min_topic_size=100,
calculate_probabilities=True)
topics, probs = topic_model.fit_transform(sentences, embeddings)
# add topics and probabilities to publications df
publication_info = topic_model.get_document_info(sentences, df=publications)
publication_info = publication_info.drop(columns=['Sentences', 'DOI', 'Top_n_words'])
publication_info = pd.concat([publication_info, pd.DataFrame(probs)], axis=1)
# look at examples for each topic
publication_info.sort_values(['Topic', 'Title']).groupby('Topic').head(5)
Thanks for taking a look at this
It took a while but I think I know what is happening.
To start off, you are using a nice trick for reducing outliers by specifying the min_dist
parameter in UMAP. This parameter can be quite tricky to get right, especially combined with min_topic_size
, so it is not often that I see users using this. In your case, you managed to get no outliers at all and still a nice separation of the clusters. What this means is that you are essentially using an HDBSCAN model that does not produce any outliers at all. This is quite rare.
As it so happens, when the topics are sorted by their frequency, the 2D-probabilities get mapped to those topics also. At least, that is was should be happening. Typically, these 2D-probabilities are only generated with HDBSCAN, which almost always produces outliers to some extent. This assumption can be found in the following lines:
I think, but I am not sure yet, that simply removing and self.get_topic(-1)
and replacing - 1
with - self.outliers
here should do the trick and fix your probabilities being mapped incorrectly. I would have to test this out a bit further but this currently seems to be the culprit.
Ah interesting. Thanks for tracking it down. And sorry, I thought I had checked to make sure my UMAP settings weren't the culprit but I must've still not been generating any outliers. Apologies if I led you a little astray in that.
I've been following this thread, very interesting edge case. Just for clarification I believe the the -1
should be replaced with
- self._outliers
unless there's another local variable that I'm not privy to?
Yes, I believe it should be removing and self.get_topic(-1)
and replacing - 1
in len(set(mappings.values()))) - 1))
with - self.outliers
but there might be a few more things that need to be replaced. These, however, seemed the most obvious.
Hi there,
I'm quite confused by the probabilities table produced using
calculate_probabilities=True
. I think in some cases the topics are all out of order.I've processed the sample dataset with 5 topics so that the resulting table is easier to interpret, and the embeddings calculated separately so the topic clustering is quicker to rerun. Neither of these steps change the behaviour I'm seeing.
Resulting topics:
I've joined the probabilities table to the document info table:
I would expect the probability of the chosen topic to appear in the corresponding column of the probabilities table. It appears that it always does if the probability of the chosen topic = 1. If the probability is < 1, it may appear in another column. In the example below, that is usually column 0 (but not always).
(I have cut some unnecessary columns from the screenshot)
I should note that due to the stochastic nature of the process, this behaviour is variable. It is not always column 0 where the probability ends up, and I have seen the probabilities in the correct columns occasionally, so you may need to run it a few times if you don't see it happening, but I would say I'm not seeing them not line up 4 times out of 5.
bertopic 0.14.0