openai incompatible issues with Bertopic #1629

Open jamesleverage opened 9 months ago

jamesleverage commented 9 months ago

prompt = """ I have a topic that contains the following documents: [DOCUMENTS] The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format: topic: """ openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

All representation models

representation_model = { "OpenAI": openai_model, # Uncomment if you will use OpenAI }

topics, probs = model.fit_transform(smaller_docs_list)

Getting following error message:

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the

When I downgrade opneai to 0.38, this error went away. However, the execution timed out after 600 seconds.

MaartenGr commented 9 months ago

Thanks for sharing! It seems that with openai 1.0.0 there were breaking changes to the API which need to be updated in BERTopic. I'll make sure to fix it in this PR since there were more OpenAI updates there.

MaartenGr commented 9 months ago

@jamesleverage If I'm not mistaken, you can use openai 0.28 instead of 0.38 and I believe it should be working. However, I just pushed a fix to the PR mentioned above that should make it work with openai >= 1.0. In the upcoming release of BERTopic, openai < 1.0 will not be supported anymore.

jamesleverage commented 9 months ago

I used openai==0.28 and getting this error at this line after running for 10+ minutes:

representation_model = OpenAI(model="gpt-3.5-turbo", chat=True)

model = BERTopic(representation_model=representation_model) topics, probs = model.fit_transform(smaller_docs_list). <==== This is where the execution hangs.

Error message:

=========================================================================== TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

ReadTimeoutError Traceback (most recent call last) ReadTimeoutError: HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=600)

During handling of the above exception, another exception occurred:

=========================================================================== The old release is causing a problem.

Is there something I can add to my code to make this work? Or should I just wait for the latest PR fix?

For simple prompting to OpenAI, this definition worked with the latest openai: `` def get_completion( prompt, model="gpt-3.5-turbo", temperature=0):

messages = [{"role": "user", "content":prompt}]

client = OpenAI(api_key=OPENAI_API_KEY)

response = model=model, messages=messages, temperature=0 )

response_message = response.choices[0].message.content

return response_message ``

MaartenGr commented 9 months ago

Is there something I can add to my code to make this work? Or should I just wait for the latest PR fix?

If 0.28 is currently not working, then I would wait until the PR fix. You can already download it if you want like this:

pip install git+
linxule commented 9 months ago

Adding to the discussion, while we wait for the PR fix, I'm trying to do the labeling after training BERTopic models. Would it be possible to adjust the number of representative documents when we use get_representative_docs?

MaartenGr commented 9 months ago

@linxule .get_representative_docs is a function that does no calculations with respect to extracting the most representative documents. For that, you would have to use the internal ._extract_representative_docs which is used to calculate which documents are most representative of a given topic. Do note though that since this is a private function, breaking changes might appear in future releases and not additional official support can be given.

I believe you can use it as follows:

import pandas as pd
documents = pd.DataFrame(
    "Document": docs,
    "ID": range(len(docs)),
    "Topic": None,
    "Image": None

repr_docs, _, _, _ = topic_model._extract_representative_docs(

Where docs are your input documents. I have not tested this so there might be a few mistakes there but the general principle should be solid.

linxule commented 9 months ago

Hi @MaartenGr ,

I tried the solution you suggested but encountered some issues

I ran

import pandas as pd

# Assuming df_dict and models are defined in an accessible scope
# df_dict: Dictionary of DataFrames
# models: Dictionary of BERTopic models

def extract_representative_documents(df_name, nr_repr_docs=5):
    Extracts representative documents for each topic from a DataFrame specified by df_name.

    - df_name: The name of the DataFrame within df_dict.
    - nr_repr_docs: Number of representative documents to extract for each topic (default is 10).

    - A DataFrame with the representative documents and their associated topics.
    if df_name not in df_dict:
        raise ValueError(f"DataFrame with name '{df_name}' not found in df_dict")
    if df_name not in models:
        raise ValueError(f"BERTopic model with name '{df_name}' not found in models")

    # Access the documents from the specified DataFrame
    docs = df_dict[df_name]['Post_Content']

    # Create a DataFrame for the documents
    documents = pd.DataFrame(
            "Document": docs,
            "ID": range(len(docs)),
            "Topic": None,
            "Image": None

    # Extract representative documents using the BERTopic model
    repr_docs, _, _, _ = models[df_name]._extract_representative_docs(

    return repr_docs

# Example usage
representative_docs = extract_representative_documents(df_name)

I got

Besides the proposed solution, is there any way to use topic_model.get_representative_docs() and specify the number of representative documents? This approach seems to be default to 3 representative documents?

MaartenGr commented 9 months ago


Besides the proposed solution, is there any way to use topic_model.get_representative_docs() and specify the number of representative documents? This approach seems to be default to 3 representative documents?

As mentioned above, topic_model.get_representative_docs() does not actually calculate which documents are most representative as that is done during Instead, topic_model.get_representative_docs() simply gets the previously calculated representative documents in a nice format. As a result, it is simply not possible to get more than 3 representative documents that way since the trained documents are not saved within the topic model. The reason for this is that saving training data within a model is something that we should generally prevent, especially if the data is large.

Instead, you can fix the error you ran into with the following code. I just tested it and it should work to extract, for example, the top 10 topics:

import pandas as pd
documents = pd.DataFrame(
    "Document": docs,
    "ID": range(len(docs)),
    "Topic": topic_model.topics_,
    "Image": None

repr_docs, _, _, _ = topic_model._extract_representative_docs(

Note that what is happening in the above code is that the documents are passed to the function which is not the case with topic_model.get_representative_docs.

linxule commented 9 months ago

@MaartenGr Thank you for your quick response.

I ran

# Access the the model and documents
docs = df_dict[df_name]['Post_Content']
topic_model = models[df_name]

# Create a DataFrame for the documents
documents = pd.DataFrame(
      "Document": docs,
      "ID": range(len(docs)),
      "Topic": topic_model.topics_,
      "Image": None

# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(

Ang got

So I adjusted it to

# Access the model and documents
docs = df_dict[df_name]['Post_Content']
topic_model = models[df_name]

# Retrieve the topics as a dictionary (replace get_topics() with the correct method)
topics_dict = topic_model.get_topics()  # This should be a dictionary

# Create a DataFrame for the documents
documents = pd.DataFrame(
        "Document": docs,
        "ID": range(len(docs)),
        "Topic": topic_model.topics_, 
        "Image": None

# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(
    topics_dict,  # Use the topics dictionary

This worked. Do you have any comments on this approach? Am I missing anything?

Thank you again for your help!

MaartenGr commented 9 months ago

The reason for your error is that you did not copy my example as showed. In ._extract_representative_docs you should use topic_model.topic_labels_ instead of using topic_model.topics_.

Don't do this:

# Extract representative documents using the BERTopic model
repr_docs, _, _, _ = topic_model._extract_representative_docs(

Do this:

repr_docs, _, _, _ = topic_model._extract_representative_docs(
linxule commented 9 months ago

Thank you so much for spotting the error! It works now!

jamesleverage commented 9 months ago

Is there something I can add to my code to make this work? Or should I just wait for the latest PR fix?

If 0.28 is currently not working, then I would wait until the PR fix. You can already download it if you want like this:

pip install git+

There is a warning message. !pip install git+

Collecting git+ Cloning (to revision refs/pull/1572/head) to /tmp/pip-req-build-9vcn0uye Running command git clone --filter=blob:none --quiet /tmp/pip-req-build-9vcn0uye WARNING: Did not find branch or tag 'refs/pull/1572/head', assuming revision or ref.

linxule commented 9 months ago

@jamesleverage You can try this

!pip install git+

This will install the Commits on Nov 17, 2023 (

giannisni commented 9 months ago

Hello! working with this collab of BERTopic - Best Practices.ipynb

I am getting this error even working with openai==0.28(It was working before). Sorry i saw the comments but I am confused.

<ipython-input-23-6a1f85215426> in <cell line: 27>()
     26 # Initialize the OpenAI model for BERTopic
---> 27 openai_model = OpenAI(model="gpt-3.5-turbo", prompt=prompt, chat=True, exponential_backoff=True)
     29 # All representation models

TypeError: OpenAI.__init__() missing 1 required positional argument: 'client'
MaartenGr commented 9 months ago

@giannisni Thanks for sharing! I just updated the notebook, can you check whether it works?

giannisni commented 9 months ago

Hi @MaartenGr it works now thanks. Thought in my copy of notebook, using the same (mine) documents as i did before I am getting this error now, which was not happening. Also can you please explain what is passed exactly on the [DOCUMENTS] in the prompt?

prompt = """ I have a topic that contains the following documents: [DOCUMENTS] The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format: topic: """

The error:

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 40728 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

MaartenGr commented 9 months ago

@giannisni Most likely, your documents are simply too big. I would advise applying document truncation using this guide.

SebastianSpeer commented 5 months ago


I'm experiencing a similar incompatibility issue. I am running bertopic version 0.16.0 and it was running fine until I updated openai. Now, I'm getting import errors no matter which version of openai I tried. I've tried 0.28, 1.10 and the latest version.

Whenever I import bertopic I'm getting the following error:

I've also created a new conda environment and retried it and it still does not work. It only runs when using version 14 or smaller. Do you know what might be going on?

MaartenGr commented 5 months ago


Based on this line in your error message, you are not using v0.16.0 but v0.15.0:

3 version = "0.15.0"

Please make sure that you are using the newest version of BERTopic.

YooWonTaek commented 4 months ago

A related question: can I somehow set the temperature argument when using openAI () to refine topic representations? Now I am using:

representation_model_openai = OpenAI(client, model="gpt-4-turbo-preview", chat=True)
topic_model.update_topics(texts, topics, representation_model=representation_model_openai)

and the topics I got changed slightly every time I ran the code.

MaartenGr commented 4 months ago

A related question: can I somehow set the temperature argument when using openAI () to refine topic representations? Now I am using:

You can use the generator_kwargs for that (see the docstrings).

and the topics I got changed slightly every time I ran the code.

I would advise checking out the FAQ. It might be that you need to install UMAP from the main branch (I believe a PR updated some things) but I'm not sure, you will have to test.

YooWonTaek commented 4 months ago

A related question: can I somehow set the temperature argument when using openAI () to refine topic representations? Now I am using:

You can use the generator_kwargs for that (see the docstrings).

and the topics I got changed slightly every time I ran the code.

I would advise checking out the FAQ. It might be that you need to install UMAP from the main branch (I believe a PR updated some things) but I'm not sure, you will have to test.

Thanks! I will try out the generator_kwargs. I have already set a seed for UMAP, and my topics are the same every time, the only difference is the topic representations I get.