Closed dvdhdn closed 8 months ago
Thanks for sharing. Could you share your full code along with BERTopic's installed version? Did you install BERTopic from the main branch.
Also, please share the full error log.
Hi! Apologies for being slightly unclear. Here is a snippet of my code (some parts slightly modified for privacy concerns). Some parts are specific to the platform I'm using (Dataiku). I load in embeddings from a dataset and then train my model.
# -*- coding: utf-8 -*-
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import transformers
from bertopic import BERTopic
import io
from sentence_transformers import SentenceTransformer
import torch
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
import openai
from cct_code_library.oai_client import OpenAIClient
from cct_code_library.config import *
from azure.identity import DefaultAzureCredential
from bertopic.representation import MaximalMarginalRelevance
import os
import json
import sklearn
from bertopic.dimensionality import BaseDimensionalityReduction
import spacy
from spacy.lang.nl.examples import sentences
from sklearn.decomposition import PCA
# import seaborn as sns
import matplotlib.pyplot as plt
import time
import ast
import numpy as np
from datetime import datetime
import pickle
import safetensors
import bertopic._save_utils as save_utils
import os
import json
from openai_utils import *
from bertopic.representation import OpenAI
from openai import AzureOpenAI
############################################################################
#######Loading and cleaning the embeddings data#############################
############################################################################
# Read recipe inputs
embeddings = dataiku.Dataset("embeddings_train_set")
embeddings_df = embeddings.get_dataframe(sampling = 'head', limit = 1000)
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: MARKDOWN
# # Setting up models
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
umap_model = UMAP(n_neighbors=15,
n_components=5,
min_dist=0.00,
random_state=42)
empty_dimensionality_model = BaseDimensionalityReduction()
hdbscan_model = HDBSCAN(min_cluster_size = 30)
general_stop_words = [
"aan", "af", "al", "als", "bij", "dan", "dat", "die", "dit", "een", "en", "er", "had", "heb", "hem", "het", "hij",
"hoe", "hun", "ik", "in", "is", "je", "kan", "me", "men", "met", "mij", "nog", "nu", "of", "ons", "ook", "te", "tot",
"uit", "van", "was", "wat", "we", "wel", "wij", "zal", "ze", "zei", "zij", "zo", "zou", "aangaande", "aangezien",
"achter", "achterna", "afgelopen", "aldaar", "aldus", "alhoewel", "alias", "alle", "allebei", "alleen", "alsnog",
"altijd", "altoos", "ander", "andere", "anders", "anderszins", "behalve", "behoudens", "beide", "beiden", "ben",
"beneden", "bent", "bepaald", "betreffende", "binnen", "binnenin", "boven", "bovenal", "bovendien", "bovengenoemd",
"bovenstaand", "bovenvermeld", "buiten", "daar", "daarheen", "daarin", "daarna", "daarnet", "daarom", "daarop",
"daarvanlangs", "de", "dikwijls", "door", "doorgaand", "dus", "echter", "eer", "eerdat", "eerder", "eerlang",
"eerst", "elk", "elke", "enig", "enigszins", "enkel", "erdoor", "even", "eveneens", "evenwel", "gauw", "gedurende",
"geen", "gehad", "gekund", "geleden", "gelijk", "gemoeten", "gemogen", "geweest", "gewoon", "gewoonweg", "haar",
"hadden", "hare", "hebben", "hebt", "heeft", "hen", "hierbeneden", "hierboven", "hoewel", "hunne", "ikzelf",
"inmiddels", "inzake", "jezelf", "jij", "jijzelf", "jou", "jouw", "jouwe", "juist", "jullie", "klaar", "kon",
"konden", "krachtens", "kunnen", "kunt", "later", "liever", "maar", "mag", "meer", "mezelf", "mijn", "mijnent",
"mijner", "mijzelf", "misschien", "mocht", "mochten", "moest", "moesten", "moet", "moeten", "mogen", "na", "naar",
"nadat", "net", "niet", "noch", "nogal", "ofschoon", "om", "omdat", "omhoog", "omlaag", "omstreeks", "omtrent",
"omver", "onder", "ondertussen", "ongeveer", "onszelf", "onze", "op", "opnieuw", "opzij", "over", "overeind",
"overigens", "pas", "precies", "reeds", "rond", "rondom", "sedert", "sinds", "sindsdien", "slechts", "sommige",
"spoedig", "steeds", "tamelijk", "tenzij", "terwijl", "thans", "tijdens", "toch", "toen", "toenmaals", "toenmalig",
"totdat", "tussen", "uitgezonderd", "vaakwat", "vandaan", "vanuit", "vanwege", "veeleer", "verder", "vervolgens",
"vol", "volgens", "voor", "vooraf", "vooral", "vooralsnog", "voorbij", "voordat", "voordezen", "voordien",
"voorheen", "voorop", "vooruit", "vrij", "vroeg", "waar", "waarom", "wanneer", "want", "waren", "weer", "weg",
"wegens", "weldra", "welk", "welke", "wie", "wiens", "wier", "wijzelf", "zelfs", "zichzelf", "zijn", "zijne",
"zodra", "zonder", "zouden", "zowat", "zulke", "zullen", "zult"
]
custom_stop_words = [
"postcode", "klant", "agent", "beller", "vraag", "vraagt"
]
stop_words = general_stop_words + custom_stop_words
vectorizer_model = CountVectorizer(stop_words = stop_words, ngram_range=(1, 2))
#left out a bunch of stuff here for privacy reasons, using the AzureOpenAI tool - testing shows responses that are working properly
oai_client = AzureOpenAI()
summarization_prompt = """
You work for the customer service department for a (redacted for online)
The following documents contain summaries of transcripts between a caller and an agent.
Your job is to extract the topic from a few documents.
The topic contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'
Based on the information above, Return a short DUTCH description summarizing
the documents fed to you. Give your output as 3 - 8 words in the following format:
topic: <description>
"""
representation_model = OpenAI(oai_client,
prompt=summarization_prompt,
model="model-gpt35-16k",
delay_in_seconds=2,
chat=True,
nr_docs=4,
doc_length=100
)
topic_model = BERTopic(umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
representation_model = representation_model,
verbose=True,
top_n_words = 10,
low_memory = True,
calculate_probabilities = False
)
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: MARKDOWN
# # Training model
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
#getting the embeddings
embeds_str = embeddings_df['embeddings'].to_numpy()
embeds = np.array([ast.literal_eval(s) for s in embeds_str])
topic_model.fit(embeddings_df['cct_base_callreason'].values, embeds)
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
topic_model.get_topics()
I get the following error:
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
<ipython-input-5-7f46b9fc4c16> in <cell line: 5>()
3 embeds = np.array([ast.literal_eval(s) for s in embeds_str])
4
----> 5 topic_model.fit(callreason_embeddings_df['cct_base_callreason'].values, embeds)
/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit(self, documents, embeddings, images, y)
314 ```
315 """
--> 316 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
317 return self
318
/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, images, y)
431 else:
432 # Extract topics by calculating c-TF-IDF
--> 433 self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
434
435 # Reduce topics
/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in _extract_topics(self, documents, embeddings, mappings, verbose)
3635 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
3636 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3637 self.topic_representations_ = self._extract_words_per_topic(words, documents)
3638 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
3639 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/_bertopic.py in _extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
3920 topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
3921 elif isinstance(self.representation_model, BaseRepresentation):
-> 3922 topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
3923 elif isinstance(self.representation_model, dict):
3924 if self.representation_model.get("Main"):
/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/representation/_openai.py in extract_topics(self, topic_model, documents, c_tf_idf, topics)
198 updated_topics = {}
199 for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
--> 200 truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
201 prompt = self._create_prompt(truncated_docs, topic, topics)
202 self.prompts_.append(prompt)
/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/representation/_openai.py in <listcomp>(.0)
198 updated_topics = {}
199 for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
--> 200 truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
201 prompt = self._create_prompt(truncated_docs, topic, topics)
202 self.prompts_.append(prompt)
/opt/dataiku/code-env/lib/python3.9/site-packages/bertopic/representation/_utils.py in truncate_document(topic_model, doc_length, tokenizer, document)
55 encoded_document = tokenizer.encode(document)
56 truncated_document = tokenizer.decode(encoded_document[:doc_length])
---> 57 return truncated_document
58 return document
59
Oh, and I am using bertopic version 0.16.0! And also I forgot to thank you - thank you for this amazing tool and the help :)
Ah, you need to also specify the tokenizer
parameter in the representation model. So the following should work:
representation_model = OpenAI(oai_client,
prompt=summarization_prompt,
model="model-gpt35-16k",
delay_in_seconds=2,
chat=True,
nr_docs=4,
doc_length=100,
tokenizer="whitespace"
)
And also I forgot to thank you - thank you for this amazing tool and the help :)
That's kind of you, thanks!
custom_stop_words = [ "postcode", "klant", "agent", "beller", "vraag", "vraagt" ]
Leuk om een Nederlandse use-case the zien! Uit nieuwsgierigheid, welk embedding model gebruik je?
Ah, you need to also specify the
tokenizer
parameter in the representation model. So the following should work:representation_model = OpenAI(oai_client, prompt=summarization_prompt, model="model-gpt35-16k", delay_in_seconds=2, chat=True, nr_docs=4, doc_length=100, tokenizer="whitespace" )
Thank you! This fixed everything :)
Leuk om een Nederlandse use-case the zien! Uit nieuwsgierigheid, welk embedding model gebruik je?
Ik gebruik momenteel "paraphrase-multilingual-mpnet-base-v2". Na een klein beetje experimentatie leek deze het meest redelijk te werken, maar in de nabije toekomst wil ik misschien iets uitgebreider testen. Voel je vrij om te outreachen mocht je nog wat vragen over deze use-case hebben :)
Hi!
I am using Bertopic for a project and trying to use an openai client as a representation model. For some reason this crashes sometimes, giving the following error:
"local variable 'truncated_document' referenced before assignment"
I think this has to do with the representative documents being passed into the representation model, but can't find the exact source of this error. Anyone has a clue what's going wrong exactly?