MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 766 forks source link

OpenAI Representation Model error #1880

Closed dvdhdn closed 7 months ago

dvdhdn commented 7 months ago

Hi!

I am using Bertopic for a project and trying to use an openai client as a representation model. For some reason this crashes sometimes, giving the following error:

"local variable 'truncated_document' referenced before assignment"

I think this has to do with the representative documents being passed into the representation model, but can't find the exact source of this error. Anyone has a clue what's going wrong exactly?

MaartenGr commented 7 months ago

Thanks for sharing. Could you share your full code along with BERTopic's installed version? Did you install BERTopic from the main branch.

MaartenGr commented 7 months ago

Also, please share the full error log.

dvdhdn commented 7 months ago

Hi! Apologies for being slightly unclear. Here is a snippet of my code (some parts slightly modified for privacy concerns). Some parts are specific to the platform I'm using (Dataiku). I load in embeddings from a dataset and then train my model.

# -*- coding: utf-8 -*-
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import transformers
from bertopic import BERTopic
import io
from sentence_transformers import SentenceTransformer
import torch
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
import openai
from cct_code_library.oai_client import OpenAIClient
from cct_code_library.config import *
from azure.identity import DefaultAzureCredential
from bertopic.representation import MaximalMarginalRelevance
import os
import json
import sklearn
from bertopic.dimensionality import BaseDimensionalityReduction
import spacy
from spacy.lang.nl.examples import sentences
from sklearn.decomposition import PCA
# import seaborn as sns
import matplotlib.pyplot as plt
import time
import ast
import numpy as np
from datetime import datetime
import pickle
import safetensors
import bertopic._save_utils as save_utils
import os
import json
from openai_utils import *
from bertopic.representation import OpenAI
from openai import AzureOpenAI

############################################################################
#######Loading and cleaning the embeddings data#############################
############################################################################

# Read recipe inputs
embeddings = dataiku.Dataset("embeddings_train_set")
embeddings_df = embeddings.get_dataframe(sampling = 'head', limit = 1000)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: MARKDOWN
# # Setting up models

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
umap_model = UMAP(n_neighbors=15,
                 n_components=5,
                 min_dist=0.00,
                 random_state=42)

empty_dimensionality_model = BaseDimensionalityReduction()

hdbscan_model = HDBSCAN(min_cluster_size = 30)

general_stop_words = [
    "aan", "af", "al", "als", "bij", "dan", "dat", "die", "dit", "een", "en", "er", "had", "heb", "hem", "het", "hij",
    "hoe", "hun", "ik", "in", "is", "je", "kan", "me", "men", "met", "mij", "nog", "nu", "of", "ons", "ook", "te", "tot",
    "uit", "van", "was", "wat", "we", "wel", "wij", "zal", "ze", "zei", "zij", "zo", "zou", "aangaande", "aangezien",
    "achter", "achterna", "afgelopen", "aldaar", "aldus", "alhoewel", "alias", "alle", "allebei", "alleen", "alsnog",
    "altijd", "altoos", "ander", "andere", "anders", "anderszins", "behalve", "behoudens", "beide", "beiden", "ben",
    "beneden", "bent", "bepaald", "betreffende", "binnen", "binnenin", "boven", "bovenal", "bovendien", "bovengenoemd",
    "bovenstaand", "bovenvermeld", "buiten", "daar", "daarheen", "daarin", "daarna", "daarnet", "daarom", "daarop",
    "daarvanlangs", "de", "dikwijls", "door", "doorgaand", "dus", "echter", "eer", "eerdat", "eerder", "eerlang",
    "eerst", "elk", "elke", "enig", "enigszins", "enkel", "erdoor", "even", "eveneens", "evenwel", "gauw", "gedurende",
    "geen", "gehad", "gekund", "geleden", "gelijk", "gemoeten", "gemogen", "geweest", "gewoon", "gewoonweg", "haar",
    "hadden", "hare", "hebben", "hebt", "heeft", "hen", "hierbeneden", "hierboven", "hoewel", "hunne", "ikzelf",
    "inmiddels", "inzake", "jezelf", "jij", "jijzelf", "jou", "jouw", "jouwe", "juist", "jullie", "klaar", "kon",
    "konden", "krachtens", "kunnen", "kunt", "later", "liever", "maar", "mag", "meer", "mezelf", "mijn", "mijnent",
    "mijner", "mijzelf", "misschien", "mocht", "mochten", "moest", "moesten", "moet", "moeten", "mogen", "na", "naar",
    "nadat", "net", "niet", "noch", "nogal", "ofschoon", "om", "omdat", "omhoog", "omlaag", "omstreeks", "omtrent",
    "omver", "onder", "ondertussen", "ongeveer", "onszelf", "onze", "op", "opnieuw", "opzij", "over", "overeind",
    "overigens", "pas", "precies", "reeds", "rond", "rondom", "sedert", "sinds", "sindsdien", "slechts", "sommige",
    "spoedig", "steeds", "tamelijk", "tenzij", "terwijl", "thans", "tijdens", "toch", "toen", "toenmaals", "toenmalig",
    "totdat", "tussen", "uitgezonderd", "vaakwat", "vandaan", "vanuit", "vanwege", "veeleer", "verder", "vervolgens",
    "vol", "volgens", "voor", "vooraf", "vooral", "vooralsnog", "voorbij", "voordat", "voordezen", "voordien",
    "voorheen", "voorop", "vooruit", "vrij", "vroeg", "waar", "waarom", "wanneer", "want", "waren", "weer", "weg",
    "wegens", "weldra", "welk", "welke", "wie", "wiens", "wier", "wijzelf", "zelfs", "zichzelf", "zijn", "zijne",
    "zodra", "zonder", "zouden", "zowat", "zulke", "zullen", "zult"
]

custom_stop_words = [
    "postcode", "klant", "agent", "beller", "vraag", "vraagt"
]

stop_words = general_stop_words + custom_stop_words
vectorizer_model = CountVectorizer(stop_words = stop_words, ngram_range=(1, 2))

#left out a bunch of stuff here for privacy reasons, using the AzureOpenAI tool - testing shows responses that are working properly
oai_client = AzureOpenAI()

summarization_prompt = """
You work for the customer service department for a (redacted for online)
The following documents contain summaries of transcripts between a caller and an agent.
Your job is to extract the topic from a few documents. 

The topic contains the following documents: 
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'

Based on the information above, Return a short DUTCH description summarizing
the documents fed to you. Give your output as 3 - 8 words in the following format:

topic: <description>
"""

representation_model = OpenAI(oai_client,
                              prompt=summarization_prompt,
                              model="model-gpt35-16k",
                                delay_in_seconds=2,
                                chat=True,
                                nr_docs=4,
                                doc_length=100
                            )

topic_model = BERTopic(umap_model=umap_model,
                           hdbscan_model=hdbscan_model,
                           vectorizer_model=vectorizer_model,
                           representation_model = representation_model,
                           verbose=True,
                           top_n_words = 10,
                           low_memory = True,
                           calculate_probabilities = False
                        )

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: MARKDOWN
# # Training model

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
#getting the embeddings
embeds_str = embeddings_df['embeddings'].to_numpy()
embeds = np.array([ast.literal_eval(s) for s in embeds_str])

topic_model.fit(embeddings_df['cct_base_callreason'].values, embeds)

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
topic_model.get_topics()

I get the following error:

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-5-7f46b9fc4c16> in <cell line: 5>()
      3 embeds = np.array([ast.literal_eval(s) for s in embeds_str])
      4 
----> 5 topic_model.fit(callreason_embeddings_df['cct_base_callreason'].values, embeds)

/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit(self, documents, embeddings, images, y)
    314         ```
    315         """
--> 316         self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    317         return self
    318 

/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, images, y)
    431         else:
    432             # Extract topics by calculating c-TF-IDF
--> 433             self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
    434 
    435             # Reduce topics

/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in _extract_topics(self, documents, embeddings, mappings, verbose)
   3635         documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
   3636         self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3637         self.topic_representations_ = self._extract_words_per_topic(words, documents)
   3638         self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3639         self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])

/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in _extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
   3920                 topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
   3921         elif isinstance(self.representation_model, BaseRepresentation):
-> 3922             topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
   3923         elif isinstance(self.representation_model, dict):
   3924             if self.representation_model.get("Main"):

/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/representation/_openai.py in extract_topics(self, topic_model, documents, c_tf_idf, topics)
    198         updated_topics = {}
    199         for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
--> 200             truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
    201             prompt = self._create_prompt(truncated_docs, topic, topics)
    202             self.prompts_.append(prompt)

/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/representation/_openai.py in <listcomp>(.0)
    198         updated_topics = {}
    199         for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
--> 200             truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
    201             prompt = self._create_prompt(truncated_docs, topic, topics)
    202             self.prompts_.append(prompt)

/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/representation/_utils.py in truncate_document(topic_model, doc_length, tokenizer, document)
     55             encoded_document = tokenizer.encode(document)
     56             truncated_document = tokenizer.decode(encoded_document[:doc_length])
---> 57         return truncated_document
     58     return document
     59 
dvdhdn commented 7 months ago

Oh, and I am using bertopic version 0.16.0! And also I forgot to thank you - thank you for this amazing tool and the help :)

MaartenGr commented 7 months ago

Ah, you need to also specify the tokenizer parameter in the representation model. So the following should work:

representation_model = OpenAI(oai_client,
                              prompt=summarization_prompt,
                              model="model-gpt35-16k",
                                delay_in_seconds=2,
                                chat=True,
                                nr_docs=4,
                                doc_length=100,
                                tokenizer="whitespace"
                            )

And also I forgot to thank you - thank you for this amazing tool and the help :)

That's kind of you, thanks!

custom_stop_words = [ "postcode", "klant", "agent", "beller", "vraag", "vraagt" ]

Leuk om een Nederlandse use-case the zien! Uit nieuwsgierigheid, welk embedding model gebruik je?

dvdhdn commented 7 months ago

Ah, you need to also specify the tokenizer parameter in the representation model. So the following should work:

representation_model = OpenAI(oai_client,
                              prompt=summarization_prompt,
                              model="model-gpt35-16k",
                                delay_in_seconds=2,
                                chat=True,
                                nr_docs=4,
                                doc_length=100,
                                tokenizer="whitespace"
                            )

Thank you! This fixed everything :)

Leuk om een Nederlandse use-case the zien! Uit nieuwsgierigheid, welk embedding model gebruik je?

Ik gebruik momenteel "paraphrase-multilingual-mpnet-base-v2". Na een klein beetje experimentatie leek deze het meest redelijk te werken, maar in de nabije toekomst wil ik misschien iets uitgebreider testen. Voel je vrij om te outreachen mocht je nog wat vragen over deze use-case hebben :)