MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Guided topic modelling np.average function not behaving as expected? #1799

Open GeorgeDeac opened 8 months ago

GeorgeDeac commented 8 months ago

Issue:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part. encountered during model fitting with provided seed words for guided topic modelling

Description

A ValueError was encountered when attempting to fit a topic model using BERTopic with the following configuration:

# Doc is a corpus of about 3K posts
doc = df[df['post_text_clean'] != '']['post_text_clean'].tolist()

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = sentence_model.encode(doc, show_progress_bar=True)

seed_topic_list = [
    ['surgery', 'bottom', 'top', 'hormone'],
    ['accept', 'acceptance', 'strength', 'needs'],
    ['connectedness', 'support', 'activism', 'mentor'],
    ['stopped', 'cancelled', 'pass', 'confusion'],
    ['peers', 'family', 'friends', 'group'],
    ['anxiety', 'depression', 'dissociation', 'anorexia'],
    ['dysphoria', 'familial', 'stress', 'health'],
    ['impulsive', 'introverted', 'sensitivity', 'shame'],
    ['violence', 'rejection', 'victimization', 'affirmation'],
    ['ideation', 'attempt', 'risk', 'prevention']
 ]

vectorizer_model = CountVectorizer(stop_words = 'english')
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

main_representation_model = KeyBERTInspired()
aspect_representation_model1 = PartOfSpeech("en_core_web_sm")
aspect_representation_model2 = [KeyBERTInspired(top_n_words=30), 
                                MaximalMarginalRelevance(diversity=.5)]

topic_model = BERTopic(embedding_model=sentence_model,
                       calculate_probabilities=True,
                       vectorizer_model = vectorizer_model,
                       ctfidf_model=ctfidf_model,
                       representation_model = representation_model,
                       seed_topic_list=seed_topic_list
                      )

topic, probs = topic_model.fit_transform(doc, embedding)

The error occurs when calling the fit_transform method on a BERTopic instance with a set of documents and their embeddings.

Probably the internal call to np.average is not behaving as expected?

When attempting to use np.average to compute a weighted average of document embeddings and seed topic embeddings, the ValueError is encountered due to passing a list of arrays with different shapes to np.average, leading to an inhomogeneous shape. This should update the document embeddings with a weighted influence from corresponding seed topic embeddings.

Steps to Reproduce

  1. Installed numpy version: 1.25.0
  2. Initialize BERTopic model with guided modelling approach.
  3. Prepare a dataset of documents and their corresponding embeddings.
  4. Call the fit_transform method on the BERTopic model.

Error Traceback


ValueError                                Traceback (most recent call last)
Cell In[7], line 104
    102 # Topic Model Fitting
    103 print("Topic model fitting..")
--> 104 topic, probs = topic_model.fit_transform(doc, embedding)
    106 # Save Model State Checkpoint
    107 print("Saving model embeddings checkpoint..")

File c:\Users\georg\anaconda3\Lib\site-packages\bertopic\_bertopic.py:399, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    397 # Guided Topic Modeling
    398 if self.seed_topic_list is not None and self.embedding_model is not None:
--> 399     y, embeddings = self._guided_topic_modeling(embeddings)
    401 # Zero-shot Topic Modeling
    402 if self._is_zeroshot():

File c:\Users\georg\anaconda3\Lib\site-packages\bertopic\_bertopic.py:3617, in BERTopic._guided_topic_modeling(self, embeddings)
   3615 for seed_topic in range(len(seed_topic_list)):
   3616     indices = [index for index, topic in enumerate(y) if topic == seed_topic]
-> 3617     embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])
   3618 logger.info("Guided - Completed \u2713")

File c:\Users\georg\anaconda3\Lib\site-packages\numpy\lib\function_base.py:511, in average(a, axis, weights, returned, keepdims)
    398 @array_function_dispatch(_average_dispatcher)
    399 def average(a, axis=None, weights=None, returned=False, *,
    400             keepdims=np._NoValue):
    401     """
    402     Compute the weighted average along the specified axis.
    403 
   (...)
    509            [4.5]])
    510     """
--> 511     a = np.asanyarray(a)
    513     if keepdims is np._NoValue:
    514         # Don't pass on the keepdims argument if one wasn't given.
    515         keepdims_kw = {}

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
MaartenGr commented 8 months ago

Thanks for sharing the extensive description of your issue. I believe this is a known issue for which the fix seems to be to lower the numpy version I believe. Could you check the link I shared for specifics?