MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.04k stars 757 forks source link

n-gram Keywords need delimiting in OpenAI() #1546

Open zilch42 opened 1 year ago

zilch42 commented 1 year ago

Hi Maarten,

I think there is a bug in the OpenAI representation model in the way the prompt is generated. The keywords are only separated by a space, not a comma, which is problematic for n-grams > 1. https://github.com/MaartenGr/BERTopic/blob/244215afebbd982f2d54678f5104af174a72688a/bertopic/representation/_openai.py#L203-L209

Without proper delimiting I end up with a prompt like this:

I have a topic that contains the following documents: 
- Legumes for mitigation of climate change and the provision of feedstock for biofuels and biorefineries. A review.
- A global spectral library to characterize the world's soil.
- Classification of natural flow regimes in Australia to support environmental flow management.
- Laboratory characterisation of shale properties.
- Effects of climate extremes on the terrestrial carbon cycle: concepts, processes and potential future impacts.
- Threat of plastic pollution to seabirds is global, pervasive, and increasing.
- Pushing the limits in marine species distribution modelling: lessons from the land present challenges and opportunities.
- Land-use futures in the shared socio-economic pathways.
- The WULCA consensus characterization model for water scarcity footprints: assessing impacts of water consumption based on available water remaining (AWARE).
- BIOCHAR APPLICATION TO SOIL: AGRONOMIC AND ENVIRONMENTAL BENEFITS AND UNINTENDED CONSEQUENCES.

The topic is described by the following keywords: food land use global properties climate using review potential change different production environmental data changes high study based years model models time used area future terrestrial plant field analysis management

Based on the information above, extract a short topic label in the following format:
topic: <topic label>

TextGeneration and Cohere look to be okay. https://github.com/MaartenGr/BERTopic/blob/244215afebbd982f2d54678f5104af174a72688a/bertopic/representation/_textgeneration.py#L130-L136

It would also be helpful to have some way to generate an example prompt with [DOCUMENTS] and [KEYWORDS] applied to help with testing so the user can actually see what's being sent. I've got a custom class bc I'm using ChatGPT on AWS so I've got extra loggers in there but its it difficult to actually see the prompt in context with standard BERTopic.

MaartenGr commented 1 year ago

Thanks for the extensive description! I'll make sure to change it in #1539

It would also be helpful to have some way to generate an example prompt with [DOCUMENTS] and [KEYWORDS] applied to help with testing so the user can actually see what's being sent. I've got a custom class bc I'm using ChatGPT on AWS so I've got extra loggers in there but its it difficult to actually see the prompt in context with standard BERTopic.

That indeed would be helpful. I can enable verbosity to print out the prompts that are given for each call but that might prove to be too much logging if you have a very large dataset.

zilch42 commented 1 year ago

What I've set up in my custom class is just for it to print the prompt for topic 0 (or the outlier topic if there are no topics) so that might be a good way to go if you want to do it with verbosity rather than a make function to generate just a prompt

MaartenGr commented 1 year ago

The thing is, when just 1 topic will be logged users might want to log every one of them and vice versa. I might add it in the LLMs themselves as additional verbosity levels but that feels a bit less intuitive wrt user experience since verbosity is handled differently throughout BERTopic.

zilch42 commented 1 year ago

Yeah it might be nice to have access to all the prompts in an easier way than extracting them from logs. Is the selection and diversification of representative documents deterministic? If so, rather than looping through the topics, generating a prompt and getting the description back one by one, you could generate all of the prompts at once, then loop through the prompts to get each representation. Then you could abstract the prompt generation to a function or method exposed to the user so they could just call a function to get all the prompts using the same arguments that they sent to the LLM initially, maybe bind them to .get_topic_info() if they wanted...

psuedo code might be something like:

representation_model = OpenAI(delay_in_seconds=5, nr_rocs=10, diversity=0.2)

topic_model = BERTopic(representation_model=representation_model)
topics, probs = topic_model.fit_transform()

topic_info = topic_model.get_topic_info()
topic_info['prompts'] = representation_model.generate_prompts()

Then rather then: https://github.com/MaartenGr/BERTopic/blob/817ad86e0c42462dac659f7b4846c6e5f7432449/bertopic/representation/_openai.py#L192-L220

You might have something like:

        # generate prompts 
        prompts = self.generate_prompts(topic_model, repr_docs_mappings, topics)

        # log an example prompt 
        logger.info("Example prompt: \n{}".format(prompts[min(1,len(prompts))]))

        # Generate using OpenAI's Language Model
        updated_topics = {}
        for topic, p in tqdm(zip(topics, prompts), total=len(topics), disable=not topic_model.verbose):

            # Delay
            if self.delay_in_seconds:
                time.sleep(self.delay_in_seconds)

            if self.chat:
                messages = [
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": p}
                ]
                kwargs = {"model": self.model, "messages": messages, **self.generator_kwargs}
                if self.exponential_backoff:
                    response = chat_completions_with_backoff(**kwargs)
                else:
                    response = openai.ChatCompletion.create(**kwargs)
                label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")
            else:
                if self.exponential_backoff:
                    response = completions_with_backoff(model=self.model, prompt=p, **self.generator_kwargs)
                else:
                    response = openai.Completion.create(model=self.model, prompt=p, **self.generator_kwargs)
                label = response["choices"][0]["text"].strip()

            updated_topics[topic] = [(label, 1)]

        return updated_topics

    def generate_prompts(self, topic_model, repr_docs_mappings, topics):
        prompts = []
        for topic, docs in repr_docs_mappings.items():
            truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
            prompts.append(self._create_prompt(truncated_docs, topic, topics))

        return prompts 

That code is based on #1539 and still needs some work... It works for generating the representations but representation_model.generate_prompts() still doesn't work as generate_prompts is sitting inside extract_topics and relies on some things that aren't easily available from the outside... but no use putting more time into it without your feedback first.

MaartenGr commented 1 year ago

Good idea! It is possible to generate the prompts before passing them to the LLM, they are currently not dependent on previous prompts. This might change in the future however so I think I would prefer to simply save the prompts after generating them iteratively. Then, you could save the prompts to the representation model and access them there.

Since the prompts are also dependent on the order of representation models (KeyBERT -> OpenAI), I think .generate_prompts would only work if OpenAI were used as a standalone. So that method would not work without running all other representation methods if they exist, which might prove to be computationally too inefficient.

Also, in your example, you would essentially create the prompts twice. Once when running .fit_transform and another time when running .generate_prompts. Instead, you could save the prompts to representation_model.OpenAI whilst creating the representations during .fit_transform. You could then access the prompts with something like representation_model.generated_prompts_.

Based on that, I would suggest the following. During any LLM representation model, save the prompts in the representation model whilst they are being created with the option of logging each of them or just the first. This would mean that the prompts are created once during .fit_transform and can easily be accessed afterward.

zilch42 commented 1 year ago

Yes, Very good points. I forget that data can be saved in objects in python (I think I still approach python with a bit of an R mindset). That sounds like a great solution.