MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.13k stars 765 forks source link

Missing function #1652

Open Matagi1996 opened 11 months ago

Matagi1996 commented 11 months ago

The tutoriels for LLM topic generation use textgeneration.py or openai, thouse classes have this function to insert topics and documents into a custom prompt.

def _create_prompt(self, docs, topic, topics): keywords = ", ".join(list(zip(*topics[topic]))[0])

    # Use the default prompt and replace keywords
    if self.prompt == DEFAULT_PROMPT:
        prompt = self.prompt.replace("[KEYWORDS]", keywords)

    # Use a prompt that leverages either keywords or documents in
    # a custom location
    else:
        prompt = self.prompt
        if "[KEYWORDS]" in prompt:
            prompt = prompt.replace("[KEYWORDS]", keywords)
        if "[DOCUMENTS]" in prompt:
            to_replace = ""
            for doc in docs:
                to_replace += f"- {doc}\n"
            prompt = prompt.replace("[DOCUMENTS]", to_replace)

    return prompt

It seems like this function is missing from the Langchain wrapper and therefore using a langchain pipeline will not replace the prompt keywords with DOCUMENTS/TOPICS

I will write my own wrapper for now, just wanted confirmation if this is the reason topics were not inserted into my prompt or if I am missing something crucial here in comparison to the other wrapers.

MaartenGr commented 11 months ago

LangChain works a bit differently from these other methods. As you can see in the source code here the prompts do not use the [DOCUMENTS] tag and instead will directly give LangChain the representative documents instead:

https://github.com/MaartenGr/BERTopic/blob/7d07e1e94e69be278f79a48d73602cdc4df0885f/bertopic/representation/_langchain.py#L171-L191

That does indeed mean that the documentation should be updated to properly describe this phenomenon.

matteomarjanovic commented 9 months ago

Does it mean that, currently, the LangChain representation model doesn't give the option to put keywords in the prompt, right?

MaartenGr commented 9 months ago

That is correct. It should be straightforward to implement yourself considering other models do have that option.