MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Update LangChain Support #2187

Open Skar0 opened 1 month ago

Skar0 commented 1 month ago

Feature request

The provided examples that leverage LangChain to create a representation all make use of langchain.chains.question_answering.load_qa_chain and the implementation is not very transparent to the user, leading to inconsistencies and difficulties to understand how to provide custom chains.

Motivation

Some of the issues in detail

Example of workarounds in current implementation

With the current implementation, a user wanting to use a custom LangChain prompt in a custom LCEL chain and add keywords to that prompt would have to do something like (ignoring that documents are passed as Document objects and not formatted into a str).

from bertopic.representation import LangChain
from langchain_core.prompts import ChatPromptTemplate

custom_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "Custom instructions."),
            ("human", "Documents: {input_documents}, Keywords: {question}"),
        ]
    )

chain = some_custom_chain_with_above_prompt

representation_model  = LangChain(chain, prompt="[KEYWORDS]")

Related issues:

Your contribution

I propose several changes, which I have started working on in a branch (made a PR to make the diff easy to see).

Questions:

MaartenGr commented 1 month ago

Awesome, thank you for the extensive description! I had hoped that LangChain would be stable for a little while longer but unfortunately that does not seem to be the case.

That said, if it's deprecated we indeed should be replacing this functionality. Let me address some things here before we continue in the PR:

the approach to add keywords in the prompt (by adding "[KEYWORDS]" in self.prompt and then performing some string manipulation) is confusing.

This behavior is used throughout all LLMs integrated in BERTopic, so if we change it here it should be changed everywhere. That said, I'm actually a big fan of using tags like "[KEYWORDS]" and "[DOCUMENTS]" to indicate where in the prompt certain aspects should go. This is part of a nice user experience and I have no intention of changing that at the moment.

Other than that (and looking at the PR), I'm wondering whether the changes make the usability for most users more complex. Take a look at this piece of the documentation you shared:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.chains.combine_documents import create_stuff_documents_chain

chat_model = ChatOpenAI(model=..., api_key=...)

prompt = ChatPromptTemplate.from_template("What are these documents about? {documents}. Please give a single label.")

chain = RunnablePassthrough.assign(representation=create_stuff_documents_chain(chat_model, prompt, document_variable_name="documents"))

That's quite a bit more involved than what it originally was:

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")

Now what it originally was needs some changes on the backend (as you nicely shared in this issue), I'm wondering whether we can simplify the accessing LangChain within BERTopic a bit more to make it simpler for users. I generally prefer additional representations to have 4 lines of code or so to do a basic LLM and nothing more.

Skar0 commented 1 month ago

Hi,

Thanks for taking the time to reply 😊

That said, I'm actually a big fan of using tags like "[KEYWORDS]" and "[DOCUMENTS]" to indicate where in the prompt certain aspects should go. This is part of a nice user experience and I have no intention of changing that at the moment.

I understand this, and I agree that it is a nice approach to format prompts when using an LLM (e.g. with OpenAI). However, in the case of LangChain, there is already a standard built-in way of formatting prompts using prompt templates.

# Example: prompt with a `topic` placeholder replaced at runtime through the input of the chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

chat_model = ChatOpenAI(model=..., api_key=...)

prompt_template = PromptTemplate.from_template("Tell me a joke about {topic}")

chain = prompt_template | chat_model

chain.invoke({"topic": "cats"})

The current implementation uses a hybrid approach to formatting the prompt, using both LangChain prompt templates and string manipulation. The sequence looks like this (I'll assume that langchain.chains.question_answering.load_qa_chain is used as it's the documented approach).

  1. The prompt (which is hard-coded) contains two placeholders: one for the documents (named context) and one for the prompt provided here to the LangChain representation object (named question). Here it is below for convenience:

    from langchain_core.prompts.chat import (
        ChatPromptTemplate,
        HumanMessagePromptTemplate,
        SystemMessagePromptTemplate,
    )
    
    system_template = """Use the following pieces of context to answer the user's question. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    ----------------
    {context}"""
    messages = [
        SystemMessagePromptTemplate.from_template(system_template),
        HumanMessagePromptTemplate.from_template("{question}"),
    ]
    CHAT_PROMPT = ChatPromptTemplate.from_messages(messages)
  2. Even though the placeholder in the prompt is named context because the document_variable_name set here is context, the chain expects the document objects to be passed through a key named input_documents (as set here). This explains why input_documents and question are used as input keys to the provided chain in the LangChain representation object.
  3. Given the above, the placement of documents into the prompt is thus performed using a LangChain prompt template placeholder (namely context). However, in case keywords need to be added in the prompt, they can currently only be provided through the question prompt template placeholder. In order to do so, one has to provide a prompt to the LangChain representation object, for example

    prompt="[KEYWORDS]"

    in which [KEYWORDS] is replaced by the actual keywords through string manipulation, and that formatted string is then passed as an input to the chain through the question key.

  4. The output of the chain is contained in a key named output_text due to some hard-coding in the object used.

I think these steps illustrate how the complex internal workings of that specific deprecated LangChain approach, together with the combination of LangChain prompt templates and string manipulations make things very confusing to a user wanting to dig deeper about what is feasible in BERTopic using LangChain (and doesn't make it easy to work with custom chains without reading the source code of the LangChain representation object to understand the expected input and output keys).

I'm wondering whether the changes make the usability for most users more complex.

Now what it originally was needs some changes on the backend (as you nicely shared in this issue), I'm wondering whether we can simplify the accessing LangChain within BERTopic a bit more to make it simpler for users.

To your point, I can modify the approach to make it simpler in general:

  1. Remove the output_text key (which I had renamed representation), which removes the need for RunnablePassthrough to create an output key (which create_stuff_documents_chain doesn't have by default).
  2. Work with a LangChain prompt template, but name the keys so that they are similar to what is used in other representations (which means that the only difference between a LangChain representation prompt and an LLM representation prompt will be the brackets used (curly vs square).
from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
representation_model = LangChain(chain)

becomes

from bertopic.representation import LangChain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI

prompt = ChatPromptTemplate.from_template("What are these documents about? {DOCUMENTS} Here are keywords related to them {KEYWORDS}.")

chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), prompt, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)

Note that we can define a prompt in the representation, like it was done before (but this time as a LangChain prompt template) and the code would become

from bertopic.representation import LangChain, DEFAULT_PROMPT
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI

chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), DEFAULT_PROMPT, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)

I made the necessary changes in the PR, let me know what you think! (I'll still need to tinker a bit to actually provide a good default prompt, and to make sure that this allows more fancy chains to work, but at least for the basic example it seems to work)

MaartenGr commented 3 weeks ago

Thanks for taking the time to so thoroughly go through this! I agree with the things that you mention, which kinda makes it difficult for BERTopic since all LLM-based representations revolve around using [DOCUMENTS] and [KEYWORDS], which I do intend to keep as that is something users are familiar with when interacting with different LLMs.

That said, I'm wondering whether we can expose it a bit different, assuming we always need create_stuff_documents_chain. If that's the case, could we simplify the API for users from this:

from bertopic.representation import LangChain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI

prompt = ChatPromptTemplate.from_template("What are these documents about? {DOCUMENTS} Here are keywords related to them {KEYWORDS}.")

chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), prompt, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)

to this:

from bertopic.representation import LangChain
from langchain.llms import OpenAI

prompt = "What are these documents about? [DOCUMENTS] Here are keywords related to them [KEYWORDS]."
llm = OpenAI(temperature=0, openai_api_key=my_openai_api_key)
representation_model = LangChain(llm, prompt)

where in LangChain, these two components are then connected:

from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

langchain_prompt = prompt.replace("[DOCUMENTS]", "{DOCUMENTS}").replace("[KEYWORDS]", "{KEYWORDS}")
langchain_prompt = ChatPromptTemplate.from_template(langchain_prompt )
chain = create_stuff_documents_chain(llm, prompt, document_variable_name="DOCUMENTS")

That makes it much easier for most users. If you instead want to use a chain, you can do so with your suggested approach, thereby exposing both the "easy" solution through llm, prompt and your solution by using chain.

I think this might be the best of both worlds but would love to get your view on this.

Skar0 commented 3 weeks ago

(disclaimer: I used ChatGPT to help generate this reply because I didn't have much time 😄)

I agree with the things that you mention, which kinda makes it difficult for BERTopic since all LLM-based representations revolve around using [DOCUMENTS] and [KEYWORDS], which I do intend to keep as that is something users are familiar with when interacting with different LLMs.

I think I understand your point better now. You mean that other LLM representation objects share a similar interface because they take a client plus an optional prompt, and the prompt is formatted using [DOCUMENTS] and [KEYWORDS] placeholders. In that sense, I can understand why you'd rather not do away with the prompt argument in the LangChain representation and prefer to work with the same placeholders.

That said, I'm wondering whether we can expose it a bit different, assuming we always need create_stuff_documents_chain.

Could you elaborate on what you mean by "always" here? Strictly speaking, you don't have to use create_stuff_documents_chain; it's just a wrapper that performs all the operations needed to take a prompt, format documents into it, run it through an LLM, and output the result. I mention it here because it's probably the cleanest (and least verbose) way to achieve what is needed for a basic version of the LangChain representation and is being supported by LangChain in their documentation.

I think this might be the best of both worlds but would love to get your view on this.

I agree that this seems to be a very good approach to maintain the existing interface while addressing the deprecation issues and simplifying/clarifying the approach for most users as well as people wanting more control over the chain. Let me summarize it like this:

Proposed Solution

  1. Keep the Prompt Argument with Placeholders: Retain the prompt argument in the LangChain representation, using [DOCUMENTS] and [KEYWORDS] as placeholders.

  2. Simplify Chain Creation Internally: Internally handle the creation of the LangChain chain, so users only need to provide the LLM client and optionally a prompt.

  3. Internal Prompt Conversion: Within the LangChain representation, convert the text prompt to a ChatPromptTemplate, replacing [DOCUMENTS] and [KEYWORDS] with {DOCUMENTS} and {KEYWORDS} to align with LangChain's template syntax.

  4. Default Prompt: Provide a default prompt that users can override if they wish.

  5. Make it Easier to Use Custom Chains: To make it easier for users to provide custom chains, we can expose more details about the internal workings:

    • Input Keys: The internally created chain expects input keys named DOCUMENTS and KEYWORDS, matching the placeholders in the prompt. These keys are used to pass the actual documents and keywords to the chain during execution. A custom provided chain should use the same keys.

    • Output Format: The chain is expected to directly return the representation, not a dictionnary where this representation is the value for a specific key. This aligns with the output of create_stuff_documents_chain, which returns a string. We expect custom chains to do the same.

Supporting Both llm, prompt, and chain Arguments

To provide maximum flexibility, we can support both approaches:

Note: Since in LangChain, both LLMs and chains are subclasses of Runnable, we could also adjust the LangChain representation to accept a Runnable object through a single argument. If a user provides a chain that is not a BaseLanguageModel or BaseChatModel, we can assume it's a custom chain, and the prompt argument can be ignored. I'll let you chose what you think is best and I'll continue with the assumption that 3 arguments (chain, llm, prompt) are used, the constructor could look like:

if chain is not None:
    self.chain = chain
elif isinstance(llm, (BaseLanguageModel, BaseChatModel)):
    if prompt is None:
        prompt = DEFAULT_PROMPT
    # Convert prompt placeholders
    langchain_prompt = prompt.replace("[DOCUMENTS]", "{DOCUMENTS}").replace("[KEYWORDS]", "{KEYWORDS}")
    # Create ChatPromptTemplate
    chat_prompt = ChatPromptTemplate.from_template(langchain_prompt)
    # Create chain using create_stuff_documents_chain
    self.chain = create_stuff_documents_chain(llm, chat_prompt, document_variable_name="DOCUMENTS")
else:
    raise ValueError("You must provide either a chain or an llm with a prompt.")

Handling Chains that Return Multiple Outputs

This is another thing that I would have like to tackle in this issue. Currently, the implementation assumes that the chain returns a single string representation. However, LangChain allows chains to return lists, which can be useful for generating several labels or aspects of the representation in a single LLM call.

We can enhance the LangChain representation to handle chains that return lists. For example:

# Execute the chain
outputs = self.chain.batch(inputs=inputs, config=self.chain_config)

# Process outputs
labels = []
for output in outputs:
    if isinstance(output, list):
        # Output is a list of labels
    else:
        # Output is a string

By supporting outputs that are either strings or lists, we enable users to create more sophisticated representations without additional LLM calls (again this would only change for custom chains, the default behavior with llm + prompt would remain the same).

Skar0 commented 3 weeks ago

FYI I've updated the PR with all changes discussed above (+ documentation). I'm not too sure about the logic around the vector output with a bunch of empty labels so I kept it in the implementation (when the output is a string, and when it's a list) but if it's not correct please let me know. I'll probably add a code example of a custom chain that outputs a list if you validate that it's appropriate (I have done it in the past, but I always concatenated the labels into a single label since only a single output was supported).

MaartenGr commented 2 weeks ago

@Skar0 Awesome, thank you for taking the time! I'll move this over to the PR so we can further discuss the implementation itself.