Open zilch42 opened 1 year ago
Thanks for the extensive description! I'll make sure to change it in #1539
It would also be helpful to have some way to generate an example prompt with [DOCUMENTS] and [KEYWORDS] applied to help with testing so the user can actually see what's being sent. I've got a custom class bc I'm using ChatGPT on AWS so I've got extra loggers in there but its it difficult to actually see the prompt in context with standard BERTopic.
That indeed would be helpful. I can enable verbosity to print out the prompts that are given for each call but that might prove to be too much logging if you have a very large dataset.
What I've set up in my custom class is just for it to print the prompt for topic 0 (or the outlier topic if there are no topics) so that might be a good way to go if you want to do it with verbosity rather than a make function to generate just a prompt
The thing is, when just 1 topic will be logged users might want to log every one of them and vice versa. I might add it in the LLMs themselves as additional verbosity levels but that feels a bit less intuitive wrt user experience since verbosity is handled differently throughout BERTopic.
Yeah it might be nice to have access to all the prompts in an easier way than extracting them from logs. Is the selection and diversification of representative documents deterministic? If so, rather than looping through the topics, generating a prompt and getting the description back one by one, you could generate all of the prompts at once, then loop through the prompts to get each representation. Then you could abstract the prompt generation to a function or method exposed to the user so they could just call a function to get all the prompts using the same arguments that they sent to the LLM initially, maybe bind them to .get_topic_info()
if they wanted...
psuedo code might be something like:
representation_model = OpenAI(delay_in_seconds=5, nr_rocs=10, diversity=0.2)
topic_model = BERTopic(representation_model=representation_model)
topics, probs = topic_model.fit_transform()
topic_info = topic_model.get_topic_info()
topic_info['prompts'] = representation_model.generate_prompts()
Then rather then: https://github.com/MaartenGr/BERTopic/blob/817ad86e0c42462dac659f7b4846c6e5f7432449/bertopic/representation/_openai.py#L192-L220
You might have something like:
# generate prompts
prompts = self.generate_prompts(topic_model, repr_docs_mappings, topics)
# log an example prompt
logger.info("Example prompt: \n{}".format(prompts[min(1,len(prompts))]))
# Generate using OpenAI's Language Model
updated_topics = {}
for topic, p in tqdm(zip(topics, prompts), total=len(topics), disable=not topic_model.verbose):
# Delay
if self.delay_in_seconds:
time.sleep(self.delay_in_seconds)
if self.chat:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": p}
]
kwargs = {"model": self.model, "messages": messages, **self.generator_kwargs}
if self.exponential_backoff:
response = chat_completions_with_backoff(**kwargs)
else:
response = openai.ChatCompletion.create(**kwargs)
label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")
else:
if self.exponential_backoff:
response = completions_with_backoff(model=self.model, prompt=p, **self.generator_kwargs)
else:
response = openai.Completion.create(model=self.model, prompt=p, **self.generator_kwargs)
label = response["choices"][0]["text"].strip()
updated_topics[topic] = [(label, 1)]
return updated_topics
def generate_prompts(self, topic_model, repr_docs_mappings, topics):
prompts = []
for topic, docs in repr_docs_mappings.items():
truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
prompts.append(self._create_prompt(truncated_docs, topic, topics))
return prompts
That code is based on #1539 and still needs some work... It works for generating the representations but representation_model.generate_prompts()
still doesn't work as generate_prompts
is sitting inside extract_topics
and relies on some things that aren't easily available from the outside... but no use putting more time into it without your feedback first.
Good idea! It is possible to generate the prompts before passing them to the LLM, they are currently not dependent on previous prompts. This might change in the future however so I think I would prefer to simply save the prompts after generating them iteratively. Then, you could save the prompts to the representation model and access them there.
Since the prompts are also dependent on the order of representation models (KeyBERT -> OpenAI), I think .generate_prompts
would only work if OpenAI were used as a standalone. So that method would not work without running all other representation methods if they exist, which might prove to be computationally too inefficient.
Also, in your example, you would essentially create the prompts twice. Once when running .fit_transform
and another time when running .generate_prompts
. Instead, you could save the prompts to representation_model.OpenAI
whilst creating the representations during .fit_transform
. You could then access the prompts with something like representation_model.generated_prompts_
.
Based on that, I would suggest the following. During any LLM representation model, save the prompts in the representation model whilst they are being created with the option of logging each of them or just the first. This would mean that the prompts are created once during .fit_transform
and can easily be accessed afterward.
Yes, Very good points. I forget that data can be saved in objects in python (I think I still approach python with a bit of an R mindset). That sounds like a great solution.
Hi Maarten,
I think there is a bug in the OpenAI representation model in the way the prompt is generated. The keywords are only separated by a space, not a comma, which is problematic for n-grams > 1. https://github.com/MaartenGr/BERTopic/blob/244215afebbd982f2d54678f5104af174a72688a/bertopic/representation/_openai.py#L203-L209
Without proper delimiting I end up with a prompt like this:
TextGeneration and Cohere look to be okay. https://github.com/MaartenGr/BERTopic/blob/244215afebbd982f2d54678f5104af174a72688a/bertopic/representation/_textgeneration.py#L130-L136
It would also be helpful to have some way to generate an example prompt with [DOCUMENTS] and [KEYWORDS] applied to help with testing so the user can actually see what's being sent. I've got a custom class bc I'm using ChatGPT on AWS so I've got extra loggers in there but its it difficult to actually see the prompt in context with standard BERTopic.