MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

OpenAI representation fails to produce output when response content is None #2176

Open jeaninejuliettes opened 1 month ago

jeaninejuliettes commented 1 month ago

Have you searched existing issues? šŸ”Ž

Desribe the bug

I ran into issues when using the OpenAI representation as it sometimes produces a content of None, which then produced an error when trying to run: label = response.choices[0].message.content.strip().replace("topic: ", "")

Which makes sense, since the content is not a string. I'm unable to generate a minimal example since this is due to the output of OpenAI GPT.

I see two ways to work around this, but both have their own downsides/impact on the results, maybe anyone else sees better option:

  1. set the content to type string before processing it any further. With the major downside that the label will then be set to the string 'None'
  2. use a try and except to extract the content, strip this and replace the 'topic:' part of the string. If this fails the label is set to a fixed value like an empty string (and producing a warning that his has happened)

For now I fixed it by creating an inherited customOpenAI representation class within my script where I used the second option as a solution.

Reproduction

from bertopic import BERTopic

BERTopic Version

0.16.4

MaartenGr commented 1 month ago

Thank you for sharing this. I see that you opened a similar issue (https://github.com/MaartenGr/BERTopic/discussions/2177). Are you alright with closing that one? To me, they seem like duplicates.

With respect to your issue, the idea of content violation was mentioned in earlier issues and addressed with the following:

https://github.com/MaartenGr/BERTopic/blob/9518035d41087a801ae39000e6ea1f3641983396/bertopic/representation/_openai.py#L232-L237

Which makes it rather surprising that you get this issue. It may be that the API of OpenAI was updated and now always returns "content" but I'm not sure. Either way, simply doing an additional check here makes sense to me.

jeaninejuliettes commented 1 month ago

No, I'm sorry this was unclear, for this specific issue I don't get any errors regarding content violation. It simply seems that the result of response.choices[0].message returns None, which then produces an error, since you can't use strip on a NoneType object. I don't know when/why this happens, but it doesnt seem to be the result of an error produced by the API, since the response object exists.

Also the reason why I created a separate "issue" (discussion/question) for the content violation, since I grasped from the code that that supposed to have been fixed, but I'm still running into this unfortunately. But that is a discussion for the #2177 as far as I'm concerned. They don't seem to be related. (as far as I can tell)

MaartenGr commented 1 month ago

I think that this:

I ran into issues when using the OpenAI representation as it sometimes produces a content of None, which then produced an error when trying to run: label = response.choices[0].message.content.strip().replace("topic: ", "")

and this:

response.choices[0].message returns None

contradict with one another. The reason why I think that is because you shouldn't be able to reach label = ... at all because there is this check (which is used for content violation):

https://github.com/MaartenGr/BERTopic/blob/9518035d41087a801ae39000e6ea1f3641983396/bertopic/representation/_openai.py#L232-L237

Thus, response.choices[0].message returns None cannot be the case because there is check to see whether it contains the attribute "content", right? Or did you mean that "content" returns None? If so, then the API of OpenAI servers might have changed since it didn't show that behavior before.

Looking through the issues, it seems that this was mentioned before and a PR that hasn't been updated in a couple of months. API changes might relate here but also the reason why you get a None, which typically is a content violation issue. Based on what I see, I'm convinced they relate to one another since the None you get is typically some sort of content violation issue.

jeaninejuliettes commented 1 month ago

Yeas, I mean that the content returns None, the response exists, but the content its returning is empty, the element content does exist in the response object. Ah, I didn't see that issue (apologies), but it is the exact error message I'm seeing. And reading through the issue, it looks quite similar. But the PR is inactive?

Funny thing is, I'm still also getting content violation errors, but let's keep that out of this discussion for now ;)

MaartenGr commented 1 month ago

It does seem to be inactive and unfortunately, I currently do not have the time to look it over. I would also be alright with a small PR just making sure it gives no error. Any additional work can be done later.

jeaninejuliettes commented 1 month ago

Ok, I can look into that!