MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Question about discrepancy in fit_transform and transform #1582

Closed joeltorby closed 1 year ago

joeltorby commented 1 year ago

I have trained a Bertopic model in the following way, given a vocabulary of keywords:

vectorizer_model = CountVectorizer(vocabulary=vocabulary)
sentence_model = SentenceTransformer("distiluse-base-multilingual-cased-v2")
topic_model = BERTopic(embedding_model=sentence_model,vectorizer_model=vectorizer_model)
topics, probabilities = self.topic_model.fit_transform(docs)
topic_model.save("model_folder", serialization="pytorch", save_ctfidf=True)  

I then read the topic model (on the same machine, with the same virtual environment):

sentence_model = SentenceTransformer("distiluse-base-multilingual-cased-v2")
loaded_model = BERTopic.load("model_folder", embedding_model=sentence_model)

When I extract the "topics_" element from the loaded_model, this is the distribution of the number of documents per topics (here the top 11 rows are shown):

topics_ count
-1    1028
0     643
1     305
2     219
3     184
4     184
5     150
6     147
7     135
8     104
9     84
10    81

I now perform a topic prediction using the transform command with the loaded_model on the exact same documents docs as was used for training:

predictions = loaded_model.transform(docs)

(here the top 11 rows are shown)

predicted_topic count
0           545
1           271
-1          258
8           235
4           225
2           188
3           184
5           178
12          159
7           134
6           131

So my question is about why there is a difference between fit_transform and transform? Is this expected or must the difference come from a coding error (I could not include all code here)?

MaartenGr commented 1 year ago

There are two reasons why this might happen which are both part of the underlying algorithm:

In other words, not necessarily a bug or a feature. It is mainly part of the underlying clustering algorithm/task.

joeltorby commented 1 year ago

Appending the argument of KMeans to the BERTopic:

cluster_model = KMeans(n_clusters=100)
BERTopic(...
hdbscan_model=cluster_model)

I now got more similar distributions, but still not perfect: from fit_transform (top 11 rows)

topics_ count
0   138
1   129
2   126
3   118
4   95
5   88
6   87
7   86
8   76
9   75
10  74

predictions, from transform (top 11 rows)

predicted_topic count
0       126
2       124
1       120
3       104
4       92
6       89
5       80
7       80
16      74
8       72
9       69

And after changing to pickle format by topic_model.save("model_folder", serialization="pickle", save_ctfidf=True) I got identical results from fit_transform and transform.

This is awesome! For my case, it is more important to be able to have an unambiguous model, than a smaller model with the ability of predicting outliers. Thanks for excessively fast response :)