MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

(Zero-shot Topic Modeling) TypeError: object of type 'numpy.float64' has no len() #2034

Open Paignn opened 3 weeks ago

Paignn commented 3 weeks ago

Hello! I'm currently working on my project, and I have a specific NLP task using BERTopic - Zero-shot Topic Modeling. Unfortunately, a bug exists when I try to form the model.

Here is my model formation:

embedding_model_en = SentenceTransformer("all-MiniLM-L6-v2")
embeddings_en = embedding_model_en.encode(df_comment_1['text_en'], show_progress_bar=True)
umap_model_en = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model_en = HDBSCAN(min_cluster_size=40, metric='euclidean', cluster_selection_method='eom',` prediction_data=True)
vectorizer_model_en = CountVectorizer(min_df=2, ngram_range=(1, 2))
zeroshot_topic_list = ["good", "bad"]
keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model_en = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
}
topic_model_en = BERTopic(
    embedding_model=embedding_model_en,
    umap_model=umap_model_en,
    hdbscan_model=hdbscan_model_en,
    vectorizer_model=vectorizer_model_en,
    representation_model=representation_model_en,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=0.5,
    verbose=True,
    nr_topics=50
)

And when I run: topics_en, probs_en = topic_model_en.fit_transform(df_comment_1['text_en'], embeddings_en)

I get the following error:

TypeError                                 Traceback (most recent call last)
[<ipython-input-13-2af852ba034a>](https://localhost:8080/#) in <cell line: 13>()
     11 )
     12 
---> 13 topics_en, probs_en = topic_model_en.fit_transform(df_comment_1['text_en'], embeddings_en)
     14 topic_model_en.save('my_model_en_22', serialization="safetensors")
     15 topic_model_en.get_topic_info()

7 frames
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit_transform(self, documents, embeddings, images, y)
    446         # Combine Zero-shot with outliers
    447         if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448             predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
    449 
    450         return predictions, self.probabilities_

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
   3619         empty_dimensionality_model = BaseDimensionalityReduction()
   3620         empty_cluster_model = BaseCluster()
-> 3621         zeroshot_model = BERTopic(
   3622                 n_gram_range=self.n_gram_range,
   3623                 low_memory=self.low_memory,

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit(self, documents, embeddings, images, y)
    314         ```
    315         """
--> 316         self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    317         return self
    318 

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit_transform(self, documents, embeddings, images, y)
    431         else:
    432             # Extract topics by calculating c-TF-IDF
--> 433             self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
    434 
    435             # Reduce topics

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _extract_topics(self, documents, embeddings, mappings, verbose)
   3784             logger.info("Representation - Extracting topics from clusters using representation models.")
   3785         documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
-> 3786         self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   3787         self.topic_representations_ = self._extract_words_per_topic(words, documents)
   3788         self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _c_tf_idf(self, documents_per_topic, fit, partial_fit)
   4006 
   4007         if fit:
-> 4008             self.ctfidf_model = self.ctfidf_model.fit(X, multiplier=multiplier)
   4009 
   4010         c_tf_idf = self.ctfidf_model.transform(X)

[/usr/local/lib/python3.10/dist-packages/bertopic/vectorizers/_ctfidf.py](https://localhost:8080/#) in fit(self, X, multiplier)
     86                 idf = idf * multiplier
     87 
---> 88             self._idf_diag = sp.diags(idf, offsets=0,
     89                                       shape=(n_features, n_features),
     90                                       format='csr',

[/usr/local/lib/python3.10/dist-packages/scipy/sparse/_construct.py](https://localhost:8080/#) in diags(diagonals, offsets, shape, format, dtype)
    146     if isscalarlike(offsets):
    147         # now check that there's actually only one diagonal
--> 148         if len(diagonals) == 0 or isscalarlike(diagonals[0]):
    149             diagonals = [np.atleast_1d(diagonals)]
    150         else:

TypeError: object of type 'numpy.float64' has no len()

How can I fix that error? Thank you.

MaartenGr commented 3 weeks ago

Hmmm, not sure what is happening here. Which version of BERTopic are you using? Also, could you try again without using vectorizer_model_en and nr_topics?