Open clstaudt opened 8 months ago
I am not entirely sure but it might be a result of calling the API too many times. Instead, it might be worthwhile to set both exponential_back=True
and set a delay of at least 1 second. Overloading the server can cause unexpected issues.
@MaartenGr I found it difficult to reproduce, sometimes it just works with the same amount of data. Consider catching the exception and logging a more meaningful error.
Perhaps. This is the first time I know of this issue, so I would like to know first what is the underlying cause before catching exceptions. If I can't reproduce it I might add some sort of exception. However, since there isn't actually an error created I can't really say, "Something went wrong but I do not know what".
If you run it without OpenAI, how many topics are then created? It feels as if something is going wrong with either the number of calls being made or the model does not know how to respond, so it doesn't.
This issue seems to be related but I am not sure how to explain empty topics.
This issue seems to refer to a "content filter", so perhaps the content of the prompt and your documents is against their regulations.
If you run it without OpenAI, how many topics are then created?
Around 50 topics. If I keep the number of topics the same but increase the input data size, the error is more likely to occur.
This issue seems to refer to a "content filter", so perhaps the content of the prompt and your documents is against their regulations.
Very unlikely, but there can be false positives.
Around 50 topics. If I keep the number of topics the same but increase the input data size, the error is more likely to occur.
That actually indicates toward the issue of the "content_filter". The more data you add, the more likely a specific document is added to the prompt which might trigger the content filter. What kind of data are you training BERTopic on?
Very unlikely, but there can be false positives.
Why is this unlikely?
Only topic keyword lists are added to the prompt. However, you are right, it is not impossible that some keyword in some topic triggers the content filter. (Documents include PubMed abstracts, so feel free to speculate). With ChatGPT I also sometimes get false positives of the content filter.
Can this be caught so that it does not derail the training?
Can this be caught so that it does not derail the training?
Certainly. However, it most likely will come with a bunch of warnings since this is not expected behavior. The difficulty here is that I can use a try/except but it is not exactly clear what is going wrong. So there will need to be some additional warnings to communicate that there is an unknown problem with OpenAI. Also, I most likely will catch this using this very specific instance, namely the KeyError: 'content'
. Opening up the try/except is asking for issues.
I'll put it on the list and make sure it is fixed before the next release!
I am getting a KeyError when running the topic model on a large dataset (400 k documents) (but not on smaller samples).
This is the representation model I am using: