MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

AttributeError: 'SentenceTransformerBackend' object has no attribute 'encode' #1169

Closed zhimin-z closed 1 year ago

zhimin-z commented 1 year ago

I tried to load the fitted topic model with the following command:

from bertopic import BERTopic
from matplotlib import pyplot as plt
from wordcloud import WordCloud

import os
import pandas as pd

path_result = 'Result'
path_dataset = 'Dataset'

path_general = os.path.join(path_result, 'General')
path_challenge = os.path.join(path_result, 'Challenge')
path_solution = os.path.join(path_result, 'Solution')

path_model_challenge = os.path.join(path_challenge, 'Model2')
path_model_solution = os.path.join(path_solution, 'Model2')

df = pd.read_json(os.path.join(path_dataset, 'preprocessed.json'))

# output the best topic model on challenges

model_challenge = 'Challenge_gpt_summary_btjbh6oz'
column_challenge = '_'.join(model_challenge.split('_')[:-1])

df['Challenge_topic'] = -1

indice_challenge = []
docs_challenge = []

for index, row in df.iterrows():
    if pd.notna(row[column_challenge]):
        indice_challenge.append(index)
        docs_challenge.append(row[column_challenge])

topic_model = BERTopic.load(os.path.join(
    path_model_challenge, model_challenge))
topics, probs = topic_model.transform(docs_challenge)

df_topics = topic_model.get_topic_info()
df_topics.to_json(os.path.join(
    path_challenge, 'Topic information.json'), indent=4, orient='records')

fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_challenge, 'Topic visualization.html'))

fig = topic_model.visualize_barchart(
    top_n_topics=df_topics.shape[0]-1, n_words=10)
fig.write_html(os.path.join(path_challenge, 'Term visualization.html'))

fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(
    path_challenge, 'Topic similarity visualization.html'))

fig = topic_model.visualize_term_rank()
fig.write_html(os.path.join(
    path_challenge, 'Term score decline visualization.html'))

hierarchical_topics = topic_model.hierarchical_topics(docs_challenge)
fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_html(os.path.join(
    path_challenge, 'Hierarchical clustering visualization.html'))

embeddings = topic_model.embedding_model.encode(
    docs_challenge, show_progress_bar=False)
fig = topic_model.visualize_documents(docs_challenge, embeddings=embeddings)
fig.write_html(os.path.join(path_challenge, 'Document visualization.html'))

# This uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
new_topics_challenge = topic_model.reduce_outliers(
    docs_challenge, topics, probabilities=probs, strategy="probabilities")

and I can guarantee that I have set up calculate_probabilities and prediction_data to True, but it still give the following error when I attempted to visualize the embedding, what should I do? @MaartenGr I appreciate it a lot in advance! image

MaartenGr commented 1 year ago

In BERTopic, any embedding model that you pass as a parameter is converted to a bertopic.backend.BaseEmbedder class. So using topic_model.embedding_model.encode will not work as the encode function is specific to sentence-transformers. Instead, using topic_model.embedding_model.embed should work.

Bougeant commented 7 months ago

IMO, replacing the embedding_model by a backend object when the fit orfit_transform method is called leads to a strange behaviour. For example,

embeddings_train = bertopic.embedding_model.encode(docs_train)
bertopic.fit(docs_train, embeddings_train)
embeddings_test = bertopic.embedding_model.encode(docs_test) --> breaks because the embedding_model has secretely been replaced by a different object

So we need to use this instead (which is a bit weird because one needs to use different functions to generate embeddings depending on whether this occurs before or after the model has been fit):

embeddings_train = bertopic.embedding_model.encode(docs_train)
bertopic.fit(docs_train, embeddings_train)
embeddings_test = bertopic._extract_embeddings(docs_test) 

Is there another recommended approach?

MaartenGr commented 7 months ago

@Bougeant Thanks for sharing this. Perhaps not intuitive, but the embedding_model was not meant to be used outside of BERTopic but merely within the model itself to generate, for example, word embeddings when using KeyBERTInspired or MaximalMarginalRelevance. The backend object was developed to create a unified approach for extracting embeddings within BERTopic.

Generally, I would advise using the embedding model outside of BERTopic to generate document embeddings as those are also typically saved outside of the model.