MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

OpenAI Embedding #2045

Closed ilanit1997 closed 2 weeks ago

ilanit1997 commented 2 weeks ago

I have computed text embeddings using the AzureOpenAI model. I aim to utilize these embeddings as input to a BERTopic object. However, I encounter runtime errors when attempting this. Below is the code snippet I used:

ctfidf_model, hdbscan_model, representation_model, vectorizer_model, umap_model = self.create_base_models()
print("finished init base models")
if optimize_flag:
    reduced_embeddings = umap_model.fit_transform(output_dict.get("embeddings"))
    umap_model = Dimensionality(reduced_embeddings)
    output_dict["clusters"] = hdbscan_model.fit(reduced_embeddings).labels_
    output_dict["hdbscan_probs"] = hdbscan_model.probabilities_
    hdbscan_model = BaseCluster()

print("finished creating base models")
self.embedding_model = OpenAIBackend(batch_size=100)
topic_model = BERTopic(
    embedding_model=self.embedding_model,
    ctfidf_model=ctfidf_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    vectorizer_model=vectorizer_model,
    nr_topics=self.config.get("nr_topics"),
    n_gram_range=self.config.get("n_gram_range"),
    min_topic_size=self.config.get("min_topic_size"),
    top_n_words=self.config.get("top_n_words"),
    seed_topic_list=self.config.get("seed_topic_list"),
    calculate_probabilities=self.config.get("calculate_probabilities", False),
    verbose=True
)

topic_model = topic_model.fit(output_dict.get("texts"),
                              output_dict.get("embeddings"),
                              y=output_dict.get("clusters"))
topics, probs = topic_model.transform(output_dict.get("texts"),
                                      output_dict.get("embeddings"))

During runtime, I encounter the following error:


File "/data/home/ilanit.sobol/***/code_files/semantic_modeling/analysis/timeseries_analysis/utils/semantic_modeling_utils.py", line 615, in run_pipeline_on_multiple_datasets
  topic_model, topics = self.fit_transform(topic_model, output_dict, self.config.get("optimized"))
File "/data/home/ilanit.sobol/***/code_files/semantic_modeling/analysis/timeseries_analysis/utils/semantic_modeling_utils.py", line 560, in fit_transform
  topic_model = topic_model.fit(output_dict.get("texts"),
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/_bertopic.py", line 316, in fit
  self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/_bertopic.py", line 433, in fit_transform
  self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/_bertopic.py", line 3787, in _extract_topics
  self.topic_representations_ = self._extract_words_per_topic(words, documents)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/_bertopic.py", line 4087, in _extract_words_per_topic
  self.topic_aspects_[aspect] = aspect_model.extract_topics(self, documents, c_tf_idf, aspects)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/representation/_keybert.py", line 91, in extract_topics
  sim_matrix, words = self._extract_embeddings(topic_model, topics, representative_docs, repr_doc_indices)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/representation/_keybert.py", line 163, in _extract_embeddings
  repr_embeddings = topic_model._extract_embeddings(representative_docs, method="document", verbose=False)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/_bertopic.py", line 3410, in _extract_embeddings
  embeddings = self.embedding_model.embed_documents(documents, verbose=verbose)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/backend/_base.py", line 69, in embed_documents
  return self.embed(document, verbose)
File "/data/home/ilanit.sobol/anaconda3/envs/llms_env/lib/python3.9/site-packages/pyAudioAnalysis/../bertopic/backend/_openai.py", line 73, in embed
  response = self.client.embeddings.create(input=batch, **self.generator_kwargs)
AttributeError: 'str' object has no attribute 'embeddings'

Additionally, I have previously used the same code with pre-computed embeddings from Sentence-BERT and specified embedding_model=SentenceTransformer() without encountering this issue.

Could you please provide guidance on how to resolve this error?

Thank you.

ilanit1997 commented 2 weeks ago

I solved this by specifying the same AzureOpenAI client into OpenAIBackend