Open Skar0 opened 1 month ago
Awesome, thank you for the extensive description! I had hoped that LangChain would be stable for a little while longer but unfortunately that does not seem to be the case.
That said, if it's deprecated we indeed should be replacing this functionality. Let me address some things here before we continue in the PR:
the approach to add keywords in the prompt (by adding "[KEYWORDS]" in self.prompt and then performing some string manipulation) is confusing.
This behavior is used throughout all LLMs integrated in BERTopic, so if we change it here it should be changed everywhere. That said, I'm actually a big fan of using tags like "[KEYWORDS]"
and "[DOCUMENTS]"
to indicate where in the prompt certain aspects should go. This is part of a nice user experience and I have no intention of changing that at the moment.
Other than that (and looking at the PR), I'm wondering whether the changes make the usability for most users more complex. Take a look at this piece of the documentation you shared:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.chains.combine_documents import create_stuff_documents_chain
chat_model = ChatOpenAI(model=..., api_key=...)
prompt = ChatPromptTemplate.from_template("What are these documents about? {documents}. Please give a single label.")
chain = RunnablePassthrough.assign(representation=create_stuff_documents_chain(chat_model, prompt, document_variable_name="documents"))
That's quite a bit more involved than what it originally was:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
Now what it originally was needs some changes on the backend (as you nicely shared in this issue), I'm wondering whether we can simplify the accessing LangChain within BERTopic a bit more to make it simpler for users. I generally prefer additional representations to have 4 lines of code or so to do a basic LLM and nothing more.
Hi,
Thanks for taking the time to reply 😊
That said, I'm actually a big fan of using tags like "[KEYWORDS]" and "[DOCUMENTS]" to indicate where in the prompt certain aspects should go. This is part of a nice user experience and I have no intention of changing that at the moment.
I understand this, and I agree that it is a nice approach to format prompts when using an LLM (e.g. with OpenAI). However, in the case of LangChain, there is already a standard built-in way of formatting prompts using prompt templates.
# Example: prompt with a `topic` placeholder replaced at runtime through the input of the chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
chat_model = ChatOpenAI(model=..., api_key=...)
prompt_template = PromptTemplate.from_template("Tell me a joke about {topic}")
chain = prompt_template | chat_model
chain.invoke({"topic": "cats"})
The current implementation uses a hybrid approach to formatting the prompt, using both LangChain prompt templates and string manipulation. The sequence looks like this (I'll assume that langchain.chains.question_answering.load_qa_chain
is used as it's the documented approach).
The prompt (which is hard-coded) contains two placeholders: one for the documents (named context
) and one for the prompt provided here to the LangChain representation object (named question
). Here it is below for convenience:
from langchain_core.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
system_template = """Use the following pieces of context to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}"),
]
CHAT_PROMPT = ChatPromptTemplate.from_messages(messages)
context
because the document_variable_name
set here is context
, the chain expects the document objects to be passed through a key named input_documents
(as set here). This explains why input_documents
and question
are used as input keys to the provided chain in the LangChain representation object.Given the above, the placement of documents into the prompt is thus performed using a LangChain prompt template placeholder (namely context
). However, in case keywords need to be added in the prompt, they can currently only be provided through the question
prompt template placeholder. In order to do so, one has to provide a prompt to the LangChain representation object, for example
prompt="[KEYWORDS]"
in which [KEYWORDS]
is replaced by the actual keywords through string manipulation, and that formatted string is then passed as an input to the chain through the question
key.
output_text
due to some hard-coding in the object used.I think these steps illustrate how the complex internal workings of that specific deprecated LangChain approach, together with the combination of LangChain prompt templates and string manipulations make things very confusing to a user wanting to dig deeper about what is feasible in BERTopic using LangChain (and doesn't make it easy to work with custom chains without reading the source code of the LangChain representation object to understand the expected input and output keys).
I'm wondering whether the changes make the usability for most users more complex.
Now what it originally was needs some changes on the backend (as you nicely shared in this issue), I'm wondering whether we can simplify the accessing LangChain within BERTopic a bit more to make it simpler for users.
To your point, I can modify the approach to make it simpler in general:
output_text
key (which I had renamed representation
), which removes the need for RunnablePassthrough
to create an output key (which create_stuff_documents_chain
doesn't have by default).from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
representation_model = LangChain(chain)
becomes
from bertopic.representation import LangChain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI
prompt = ChatPromptTemplate.from_template("What are these documents about? {DOCUMENTS} Here are keywords related to them {KEYWORDS}.")
chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), prompt, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)
Note that we can define a prompt in the representation, like it was done before (but this time as a LangChain prompt template) and the code would become
from bertopic.representation import LangChain, DEFAULT_PROMPT
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI
chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), DEFAULT_PROMPT, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)
I made the necessary changes in the PR, let me know what you think! (I'll still need to tinker a bit to actually provide a good default prompt, and to make sure that this allows more fancy chains to work, but at least for the basic example it seems to work)
Thanks for taking the time to so thoroughly go through this! I agree with the things that you mention, which kinda makes it difficult for BERTopic since all LLM-based representations revolve around using [DOCUMENTS] and [KEYWORDS], which I do intend to keep as that is something users are familiar with when interacting with different LLMs.
That said, I'm wondering whether we can expose it a bit different, assuming we always need create_stuff_documents_chain
. If that's the case, could we simplify the API for users from this:
from bertopic.representation import LangChain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.llms import OpenAI
prompt = ChatPromptTemplate.from_template("What are these documents about? {DOCUMENTS} Here are keywords related to them {KEYWORDS}.")
chain = create_stuff_documents_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), prompt, document_variable_name="DOCUMENTS")
representation_model = LangChain(chain)
to this:
from bertopic.representation import LangChain
from langchain.llms import OpenAI
prompt = "What are these documents about? [DOCUMENTS] Here are keywords related to them [KEYWORDS]."
llm = OpenAI(temperature=0, openai_api_key=my_openai_api_key)
representation_model = LangChain(llm, prompt)
where in LangChain
, these two components are then connected:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
langchain_prompt = prompt.replace("[DOCUMENTS]", "{DOCUMENTS}").replace("[KEYWORDS]", "{KEYWORDS}")
langchain_prompt = ChatPromptTemplate.from_template(langchain_prompt )
chain = create_stuff_documents_chain(llm, prompt, document_variable_name="DOCUMENTS")
That makes it much easier for most users. If you instead want to use a chain, you can do so with your suggested approach, thereby exposing both the "easy" solution through llm, prompt
and your solution by using chain
.
I think this might be the best of both worlds but would love to get your view on this.
(disclaimer: I used ChatGPT to help generate this reply because I didn't have much time 😄)
I agree with the things that you mention, which kinda makes it difficult for BERTopic since all LLM-based representations revolve around using [DOCUMENTS] and [KEYWORDS], which I do intend to keep as that is something users are familiar with when interacting with different LLMs.
I think I understand your point better now. You mean that other LLM representation objects share a similar interface because they take a client plus an optional prompt, and the prompt is formatted using [DOCUMENTS]
and [KEYWORDS]
placeholders. In that sense, I can understand why you'd rather not do away with the prompt argument in the LangChain representation and prefer to work with the same placeholders.
That said, I'm wondering whether we can expose it a bit different, assuming we always need
create_stuff_documents_chain
.
Could you elaborate on what you mean by "always" here? Strictly speaking, you don't have to use create_stuff_documents_chain
; it's just a wrapper that performs all the operations needed to take a prompt, format documents into it, run it through an LLM, and output the result. I mention it here because it's probably the cleanest (and least verbose) way to achieve what is needed for a basic version of the LangChain representation and is being supported by LangChain in their documentation.
I think this might be the best of both worlds but would love to get your view on this.
I agree that this seems to be a very good approach to maintain the existing interface while addressing the deprecation issues and simplifying/clarifying the approach for most users as well as people wanting more control over the chain. Let me summarize it like this:
Keep the Prompt Argument with Placeholders: Retain the prompt
argument in the LangChain
representation, using [DOCUMENTS]
and [KEYWORDS]
as placeholders.
Simplify Chain Creation Internally: Internally handle the creation of the LangChain chain, so users only need to provide the LLM client and optionally a prompt.
Internal Prompt Conversion: Within the LangChain
representation, convert the text prompt to a ChatPromptTemplate
, replacing [DOCUMENTS]
and [KEYWORDS]
with {DOCUMENTS}
and {KEYWORDS}
to align with LangChain's template syntax.
Default Prompt: Provide a default prompt that users can override if they wish.
Make it Easier to Use Custom Chains: To make it easier for users to provide custom chains, we can expose more details about the internal workings:
Input Keys: The internally created chain expects input keys named DOCUMENTS
and KEYWORDS
, matching the placeholders in the prompt. These keys are used to pass the actual documents and keywords to the chain during execution. A custom provided chain should use the same keys.
Output Format: The chain is expected to directly return the representation, not a dictionnary where this representation is the value for a specific key. This aligns with the output of create_stuff_documents_chain
, which returns a string. We expect custom chains to do the same.
llm
, prompt
, and chain
ArgumentsTo provide maximum flexibility, we can support both approaches:
Simplified Interface: Users can provide an llm
and a prompt
, and the LangChain
representation will internally create the chain using create_stuff_documents_chain
.
Custom Chain: Alternatively, users can provide a custom chain
object if they need more control over the chain's behavior.
Note: Since in LangChain, both LLMs and chains are subclasses of Runnable
, we could also adjust the LangChain
representation to accept a Runnable
object through a single argument. If a user provides a chain
that is not a BaseLanguageModel
or BaseChatModel
, we can assume it's a custom chain, and the prompt
argument can be ignored. I'll let you chose what you think is best and I'll continue with the assumption that 3 arguments (chain
, llm
, prompt
) are used, the constructor could look like:
if chain is not None:
self.chain = chain
elif isinstance(llm, (BaseLanguageModel, BaseChatModel)):
if prompt is None:
prompt = DEFAULT_PROMPT
# Convert prompt placeholders
langchain_prompt = prompt.replace("[DOCUMENTS]", "{DOCUMENTS}").replace("[KEYWORDS]", "{KEYWORDS}")
# Create ChatPromptTemplate
chat_prompt = ChatPromptTemplate.from_template(langchain_prompt)
# Create chain using create_stuff_documents_chain
self.chain = create_stuff_documents_chain(llm, chat_prompt, document_variable_name="DOCUMENTS")
else:
raise ValueError("You must provide either a chain or an llm with a prompt.")
This is another thing that I would have like to tackle in this issue. Currently, the implementation assumes that the chain returns a single string representation. However, LangChain allows chains to return lists, which can be useful for generating several labels or aspects of the representation in a single LLM call.
We can enhance the LangChain
representation to handle chains that return lists. For example:
# Execute the chain
outputs = self.chain.batch(inputs=inputs, config=self.chain_config)
# Process outputs
labels = []
for output in outputs:
if isinstance(output, list):
# Output is a list of labels
else:
# Output is a string
By supporting outputs that are either strings or lists, we enable users to create more sophisticated representations without additional LLM calls (again this would only change for custom chains, the default behavior with llm + prompt would remain the same).
FYI I've updated the PR with all changes discussed above (+ documentation). I'm not too sure about the logic around the vector output with a bunch of empty labels so I kept it in the implementation (when the output is a string, and when it's a list) but if it's not correct please let me know. I'll probably add a code example of a custom chain that outputs a list if you validate that it's appropriate (I have done it in the past, but I always concatenated the labels into a single label since only a single output was supported).
@Skar0 Awesome, thank you for taking the time! I'll move this over to the PR so we can further discuss the implementation itself.
Feature request
The provided examples that leverage LangChain to create a representation all make use of
langchain.chains.question_answering.load_qa_chain
and the implementation is not very transparent to the user, leading to inconsistencies and difficulties to understand how to provide custom chains.Motivation
Some of the issues in detail
langchain.chains.question_answering.load_qa_chain
is now depricated and will be removed at some point.prompt
can be specified in the constructor of theLangChain
class. However this is not a prompt but rather a custom instruction that is passed to the provided chain through thequestion
key.langchain.chains.question_answering.load_qa_chain
(which is the provided example), thisquestion
key is added as part of a larger, hard-coded (and not transparent to a casual user) prompt.langchain.chains.question_answering.load_qa_chain
chain to avoid this hard-coded prompt (this is currently not very clearly documented). In addition, if that specific chain is not used, the use of aquestion
key can be confusing."[KEYWORDS]"
inself.prompt
and then performing some string manipulation) is confusing.Example of workarounds in current implementation
With the current implementation, a user wanting to use a custom LangChain prompt in a custom LCEL chain and add keywords to that prompt would have to do something like (ignoring that documents are passed as Document objects and not formatted into a str).
Related issues:
Your contribution
I propose several changes, which I have started working on in a branch (made a PR to make the diff easy to see).
langchain.chains.question_answering.load_qa_chain
is replaced bylangchain.chains.combine_documents.stuff.create_stuff_documents_chain
as recommended in the migration guide.langchain.chains.question_answering.load_qa_chain
).LangChain
as the prompt must now be explicitly created with the chain object.documents
,keywords
, andrepresentation
(note thatlangchain.chains.combine_documents.stuff.create_stuff_documents_chain
does not have aoutput_text
output key and therepresentation
key must thus be added).keywords
key is always provided to the chain (but it's up to the user to include a placeholder for it in their prompt).Questions:
DEFAULT_PROMPT
? For examplehowever it could only be used directly in
langchain.chains.combine_documents.stuff.create_stuff_documents_chain
which takes care of formatting the documents.