Closed zhimin-z closed 1 year ago
In BERTopic, any embedding model that you pass as a parameter is converted to a bertopic.backend.BaseEmbedder
class. So using topic_model.embedding_model.encode
will not work as the encode
function is specific to sentence-transformers. Instead, using topic_model.embedding_model.embed
should work.
IMO, replacing the embedding_model
by a backend object when the fit
orfit_transform
method is called leads to a strange behaviour. For example,
embeddings_train = bertopic.embedding_model.encode(docs_train)
bertopic.fit(docs_train, embeddings_train)
embeddings_test = bertopic.embedding_model.encode(docs_test) --> breaks because the embedding_model has secretely been replaced by a different object
So we need to use this instead (which is a bit weird because one needs to use different functions to generate embeddings depending on whether this occurs before or after the model has been fit):
embeddings_train = bertopic.embedding_model.encode(docs_train)
bertopic.fit(docs_train, embeddings_train)
embeddings_test = bertopic._extract_embeddings(docs_test)
Is there another recommended approach?
@Bougeant Thanks for sharing this. Perhaps not intuitive, but the embedding_model
was not meant to be used outside of BERTopic but merely within the model itself to generate, for example, word embeddings when using KeyBERTInspired or MaximalMarginalRelevance. The backend object was developed to create a unified approach for extracting embeddings within BERTopic.
Generally, I would advise using the embedding model outside of BERTopic to generate document embeddings as those are also typically saved outside of the model.
I tried to load the fitted topic model with the following command:
and I can guarantee that I have set up
calculate_probabilities
andprediction_data
toTrue
, but it still give the following error when I attempted to visualize the embedding, what should I do? @MaartenGr I appreciate it a lot in advance!