MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.13k stars 765 forks source link

OpenAI representation not working. It just never completes the running process, as shown in the below code. #1667

Open Manas-Shrivastav opened 11 months ago

Manas-Shrivastav commented 11 months ago

import openai from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech

KeyBERT

keybert_model = KeyBERTInspired()

GPT-3.5

prompt = """ I have a topic that contains the following documents: [DOCUMENTS] The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format: topic: """ client = openai.OpenAI(api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") openai_model = OpenAI(client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

All representation models

representation_model = { "KeyBERT": keybert_model, "OpenAI": openai_model, # Uncomment if you will use OpenAI }

from bertopic import BERTopic

topic_model = BERTopic(

Pipeline models

embedding_model=embedding_model,

umap_model=umap_model, representation_model=representation_model,

Hyperparameters

nr_topics= "auto",

min_topic_size=30,

verbose=True )

topics, probs = topic_model.fit_transform(subgroup_dfs['Pet Foods']['productMaterial'])

  2023-12-06 05:33:42,424 - BERTopic - Embedding - Transforming documents to embeddings.
  .gitattributes: 100%
  1.18k/1.18k [00:00<00:00, 58.7kB/s]
  1_Pooling/config.json: 100%
  190/190 [00:00<00:00, 12.6kB/s]
  README.md: 100%
  10.6k/10.6k [00:00<00:00, 651kB/s]
  config.json: 100%
  612/612 [00:00<00:00, 40.0kB/s]
  config_sentence_transformers.json: 100%
  116/116 [00:00<00:00, 8.19kB/s]
  data_config.json: 100%
  39.3k/39.3k [00:00<00:00, 605kB/s]
  pytorch_model.bin: 100%
  90.9M/90.9M [00:00<00:00, 171MB/s]
  sentence_bert_config.json: 100%
  53.0/53.0 [00:00<00:00, 2.44kB/s]
  special_tokens_map.json: 100%
  112/112 [00:00<00:00, 7.00kB/s]
  tokenizer.json: 100%
  466k/466k [00:00<00:00, 2.39MB/s]
  tokenizer_config.json: 100%
  350/350 [00:00<00:00, 27.1kB/s]
  train_script.py: 100%
  13.2k/13.2k [00:00<00:00, 828kB/s]
  vocab.txt: 100%
  232k/232k [00:00<00:00, 14.7MB/s]
  modules.json: 100%
  349/349 [00:00<00:00, 28.5kB/s]
  Batches: 100%
  90/90 [00:04<00:00, 61.69it/s]
  2023-12-06 05:33:56,511 - BERTopic - Embedding - Completed ✓
  2023-12-06 05:33:56,512 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
  2023-12-06 05:34:35,919 - BERTopic - Dimensionality - Completed ✓
  2023-12-06 05:34:35,921 - BERTopic - Cluster - Start clustering the reduced embeddings
  2023-12-06 05:34:36,010 - BERTopic - Cluster - Completed ✓
  2023-12-06 05:34:36,011 - BERTopic - Representation - Extracting topics from clusters using representation models.
    0%|          | 0/31 [00:00<?, ?it/s]
MaartenGr commented 11 months ago

It might not be able to connect to the OpenAI servers. Could you perhaps try any prompt and use openai directly to see whether it can make a connection? Also, a good tip is to follow along with the best practices. For instance, I would advise not using nr_topics="auto" and instead use min_topic_size since it requires fewer calls to the API.