deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
15.98k stars 1.79k forks source link

feat: Tokenizer Aware Prompt Builder #6593

Open sjrl opened 8 months ago

sjrl commented 8 months ago

Is your feature request related to a problem? Please describe. For RAG QA often we want to fully utilize the context window of the model by inserting as many retrieved documents as possible. However, it is not easily possible for a user to know ahead of time how many documents they can pass to the LLM without overflowing the context window. Currently this can only be accomplished with trial and error and often times choosing a "correct" top_k is not possible because the documents in a database can vary greatly in length so some queries might cause an overflow and others not depending on the retrieved documents.

Describe the solution you'd like Therefore, we would like to create a type of Prompt Builder that can truncate some of the inserted variables into the prompt (e.g. truncate the documents but none of the instructions). This would basically amount to calculating a dynamic top_k based on the token count of the retrieved documents. To be able to perform this truncation this Prompt Builder would need to be tokenizer aware.

This would allow users to set a relatively large top_k with the confidence that the most irrelevant documents get removed if they happen to cause the context window to be exceeded. This would provide a more consistent search experience to users since we would no longer run the risk of removing instructions that often come after the inserted documents in the prompt.

mathislucka commented 8 months ago

Can we reformulate the issue to something like:

"Provide tokenization options to limit document or text length in pipelines"

I could see multiple places where this applies and multiple strategies too.

For documents in a prompt we could:

~But it's not only prompts, the same would apply to a ranker too.~

sjrl commented 8 months ago

But it's not only prompts, the same would apply to a ranker too.

Just out of curiosity, what scenario do you have in mind where this would be relevant to have only in the Ranker?

mathislucka commented 8 months ago

I was actually too fast with that :D

Ranker only gets one document at a time.

medsriha commented 3 months ago

I'm going to give this a shot. I'll draft something under PromptTokenAwareBuilder, and report back for feedback.

CarlosFerLo commented 2 months ago

Wouldn't it be nicer just to add a new component that crops context to the amount of tokens needed depending on how you want to do it, defeating docs or just a piece. This way, we don't need to change all the components one by one to adopt this, just add this component to the pipeline.

mathislucka commented 2 months ago

Yes, came around on that too.

My thoughts were something like:

DocumentsTokenTruncater (probably not a great name) that accepts a list of documents and can truncate them according to different strategies (e.g. truncate left, right, or each).

A TextTokenTruncater could be added for other use cases.

The only problem with that approach would be document metadata that you want to use in the prompt.

Generally, I feel like this is less important now since the context length of most models has increased so much.

sjrl commented 2 months ago

This way, we don't need to change all the components one by one to adopt this, just add this component to the pipeline.

One other (small) issue I forsee when using a separate component is that it's not easily possible to know how much you should truncate the documents by. To precisely know how many tokens the documents should be truncated to requires knowing how many tokens are being used up in the prompt of the PromptBuilder. And that's not easily possible unless the functionality is added to the PromptBuilder.

Generally, I feel like this is less important now since the context length of most models has increased so much.

And yeah I agree with this, this has become less urgent now since context lengths are so large nowadays.

mathislucka commented 2 months ago

One other (small) issue I forsee when using a separate component is that it's not easily possible to know how much you should truncate the documents by. To precisely know how many tokens the documents should be truncated to requires knowing how many tokens are being used up in the prompt of the PromptBuilder. And that's not easily possible unless the functionality is added to the PromptBuilder.

People could either estimate or count the tokens in their prompt template and then use that to configure the truncater. Not perfect but it would work.

CarlosFerLo commented 2 months ago

One other (small) issue I forsee when using a separate component is that it's not easily possible to know how much you should truncate the documents by. To precisely know how many tokens the documents should be truncated to requires knowing how many tokens are being used up in the prompt of the PromptBuilder. And that's not easily possible unless the functionality is added to the PromptBuilder.

People could either estimate or count the tokens in their prompt template and then use that to configure the truncater. Not perfect but it would work.

We could just add a count tokens method to the prompt template that accepts a tokenizer and returns the number of tokens of the prompt after removing all the jinja stuff.