Different topic assignment on training data when using saved model

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

MIT License

6.19k stars 765 forks source link

Have you searched existing issues? 🔎

[X] I have searched and found no existing issues

Desribe the bug

When I save a model with pytorch serialization, then use the model to transform the training data, the new topic assignment is different from the "old" topic assignment.

Reproduction

from bertopic import BERTopic

topic_model_new = BERTopic.load("model")
# old topic assigment
new_df = topic_model_new.get_document_info(abstracts)
# new topic assignment
topics, probs = topic_model_new.transform(abstracts, embeddings)
(new_df['Topic'] == np.array(topics)).value_counts()

Topic True 1168 False 157 Name: count, dtype: int64

BERTopic Version

0.16.3

MaartenGr / BERTopic

Different topic assignment on training data when using saved model #2140

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

BERTopic Version