Closed joeltorby closed 1 year ago
There are two reasons why this might happen which are both part of the underlying algorithm:
In other words, not necessarily a bug or a feature. It is mainly part of the underlying clustering algorithm/task.
Appending the argument of KMeans to the BERTopic:
cluster_model = KMeans(n_clusters=100)
BERTopic(...
hdbscan_model=cluster_model)
I now got more similar distributions, but still not perfect: from fit_transform (top 11 rows)
topics_ count
0 138
1 129
2 126
3 118
4 95
5 88
6 87
7 86
8 76
9 75
10 74
predictions, from transform (top 11 rows)
predicted_topic count
0 126
2 124
1 120
3 104
4 92
6 89
5 80
7 80
16 74
8 72
9 69
And after changing to pickle format by
topic_model.save("model_folder", serialization="pickle", save_ctfidf=True)
I got identical results from fit_transform and transform.
This is awesome! For my case, it is more important to be able to have an unambiguous model, than a smaller model with the ability of predicting outliers. Thanks for excessively fast response :)
I have trained a Bertopic model in the following way, given a vocabulary of keywords:
I then read the topic model (on the same machine, with the same virtual environment):
When I extract the "topics_" element from the loaded_model, this is the distribution of the number of documents per topics (here the top 11 rows are shown):
I now perform a topic prediction using the
transform
command with theloaded_model
on the exact same documentsdocs
as was used for training:predictions = loaded_model.transform(docs)
(here the top 11 rows are shown)
So my question is about why there is a difference between fit_transform and transform? Is this expected or must the difference come from a coding error (I could not include all code here)?