MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.15k stars 764 forks source link

Different topic assignment on training data when using saved model #2140

Open tmtsmrsl opened 2 months ago

tmtsmrsl commented 2 months ago

Have you searched existing issues? 🔎

Desribe the bug

When I save a model with pytorch serialization, then use the model to transform the training data, the new topic assignment is different from the "old" topic assignment.

Reproduction

from bertopic import BERTopic

topic_model_new = BERTopic.load("model")
# old topic assigment
new_df = topic_model_new.get_document_info(abstracts)
# new topic assignment
topics, probs = topic_model_new.transform(abstracts, embeddings)
(new_df['Topic'] == np.array(topics)).value_counts()

Topic True 1168 False 157 Name: count, dtype: int64

BERTopic Version

0.16.3

MaartenGr commented 2 months ago

Thank you for reaching out. This is expected behavior because when you save a model using pytorch the underlying dimensionality reduction and clustering models are removed from the model. To then still have inference, a different technique is used to assign documents to topics (through cosine similarity between document and topic embeddings).

Do note that something similar might even happen when you use pickle because HDBSCAN does an approximation during inference and is likely to differ from its results during training.