MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.78k stars 718 forks source link

OpenAI Representation: KeyError: 'content' #1570

Open clstaudt opened 8 months ago

clstaudt commented 8 months ago

I am getting a KeyError when running the topic model on a large dataset (400 k documents) (but not on smaller samples).

KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 topic_labeller.fit(
      2     data_train["content_truncated"]
      3 )

File /mnt/batch/tasks/shared/LS_root/mounts/clusters/staudtc-dsh-search-n2/code/Users/Christian.Staudt.external/experimentation-search/cognitive-search/topic-labelling/cognitive_search_topic_labelling/model.py:222, in TopicLabeller.fit(self, docs, y)
    219 if self.pre_trained_embeddings:
    220     fit_args.update(dict(embeddings=self.pre_trained_embeddings))
--> 222 self.topic_model.fit(**fit_args)
    224 return self

File /anaconda/envs/topic-labelling/lib/python3.10/site-packages/bertopic/_bertopic.py:303, in BERTopic.fit(self, documents, embeddings, images, y)
    262 def fit(self,
    263         documents: List[str],
    264         embeddings: np.ndarray = None,
    265         images: List[str] = None,
    266         y: Union[List[int], np.ndarray] = None):
    267     """ Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics
    268 
    269     Arguments:
   (...)
    301     ```
    302     """
--> 303     self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    304     return self

File /anaconda/envs/topic-labelling/lib/python3.10/site-packages/bertopic/_bertopic.py:411, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    408     self._save_representative_docs(custom_documents)
    409 else:
    410      # Extract topics by calculating c-TF-IDF
--> 411     self._extract_topics(documents, embeddings=embeddings)
    413     # Reduce topics
    414     if self.nr_topics:

File /anaconda/envs/topic-labelling/lib/python3.10/site-packages/bertopic/_bertopic.py:3296, in BERTopic._extract_topics(self, documents, embeddings, mappings)
   3294 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
   3295 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3296 self.topic_representations_ = self._extract_words_per_topic(words, documents)
   3297 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3298 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
   3299                       for key, values in
   3300                       self.topic_representations_.items()}

File /anaconda/envs/topic-labelling/lib/python3.10/site-packages/bertopic/_bertopic.py:3586, in BERTopic._extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
   3584                 self.topic_aspects_[aspect] = aspects
   3585             elif isinstance(aspect_model, BaseRepresentation):
-> 3586                 self.topic_aspects_[aspect] = aspect_model.extract_topics(self, documents, c_tf_idf, aspects)
   3588 return topics

File /anaconda/envs/topic-labelling/lib/python3.10/site-packages/bertopic/representation/_openai.py:191, in OpenAI.extract_topics(self, topic_model, documents, c_tf_idf, topics)
    189     else:
    190         response = openai.ChatCompletion.create(**kwargs)
--> 191     label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")
    192 else:
    193     if self.exponential_backoff:

KeyError: 'content'

This is the representation model I am using:

 bertopic.representation.OpenAI(
        model="gpt-35-turbo",
        chat=True,
        exponential_backoff=False,
        # delay_in_seconds=1,
        generator_kwargs={"engine": "gpt-35-turbo", "temperature": 0.1},
        prompt=f"""
                Output a concise, English topic label for the following keywords. Output only the label, format example for a label: Lorem Ipsum 
                Never make the label a list of keywords. Ensure short labels, single terms or short term combinations. Do not add a period at the end of the label. If you are unable to perform the task, output: None 
                [KEYWORDS]
            """,
    )
MaartenGr commented 8 months ago

I am not entirely sure but it might be a result of calling the API too many times. Instead, it might be worthwhile to set both exponential_back=True and set a delay of at least 1 second. Overloading the server can cause unexpected issues.

clstaudt commented 8 months ago

@MaartenGr I found it difficult to reproduce, sometimes it just works with the same amount of data. Consider catching the exception and logging a more meaningful error.

MaartenGr commented 8 months ago

Perhaps. This is the first time I know of this issue, so I would like to know first what is the underlying cause before catching exceptions. If I can't reproduce it I might add some sort of exception. However, since there isn't actually an error created I can't really say, "Something went wrong but I do not know what".

If you run it without OpenAI, how many topics are then created? It feels as if something is going wrong with either the number of calls being made or the model does not know how to respond, so it doesn't.

This issue seems to be related but I am not sure how to explain empty topics.

This issue seems to refer to a "content filter", so perhaps the content of the prompt and your documents is against their regulations.

clstaudt commented 8 months ago

If you run it without OpenAI, how many topics are then created?

Around 50 topics. If I keep the number of topics the same but increase the input data size, the error is more likely to occur.

This issue seems to refer to a "content filter", so perhaps the content of the prompt and your documents is against their regulations.

Very unlikely, but there can be false positives.

MaartenGr commented 8 months ago

Around 50 topics. If I keep the number of topics the same but increase the input data size, the error is more likely to occur.

That actually indicates toward the issue of the "content_filter". The more data you add, the more likely a specific document is added to the prompt which might trigger the content filter. What kind of data are you training BERTopic on?

Very unlikely, but there can be false positives.

Why is this unlikely?

clstaudt commented 8 months ago

Only topic keyword lists are added to the prompt. However, you are right, it is not impossible that some keyword in some topic triggers the content filter. (Documents include PubMed abstracts, so feel free to speculate). With ChatGPT I also sometimes get false positives of the content filter.

Can this be caught so that it does not derail the training?

MaartenGr commented 8 months ago

Can this be caught so that it does not derail the training?

Certainly. However, it most likely will come with a bunch of warnings since this is not expected behavior. The difficulty here is that I can use a try/except but it is not exactly clear what is going wrong. So there will need to be some additional warnings to communicate that there is an unknown problem with OpenAI. Also, I most likely will catch this using this very specific instance, namely the KeyError: 'content'. Opening up the try/except is asking for issues.

I'll put it on the list and make sure it is fixed before the next release!