Chat history - Githubissues

pmeier commented 1 month ago

Feature description

Currently each prompt passed to Chat.answer will be independent of the chat history. Meaning, although it is presented as one chat, each prompt-answer-pair is independent from each other. This can cause some confusion as users might reference a previous prompt or answer will get a nonsensical answer given that the assistant is not aware of them.

To overcome this, I propose we add a chat history feature that can be toggled on and off. I'm not sure on what level this should apply:

On a global level, i.e. being passed to ragna.Rag() or set in the configuration
On a chat level, i.e. being passed to Rag().chat()
On a message level, i.e. being passed to Chat().answer()

I prefer 2. here, but I'm open to other opinions.

I'm not sure about the interface either. Naively, I would say that instead of passing the prompt to the assistant

https://github.com/Quansight/ragna/blob/84cf4f627ebf52b061681a7a8b106daef3e79a1d/ragna/core/_components.py#L216

We could also pass a list of messages here. The current case would have only a single item in there, whereas with chat history we would have multiple entries. If we go for this implementation, we also need to decide whether we pass strs or ragna.core.Messages and whether we pass a flat list or a list of prompt-answer-pairs. Let's survey the ecosystem and not reinvent the wheel here.

Value and/or benefit

Less user confusion and actual conversations rather than independent Q/A.

Anything else?

To pass historical messages to the assistant, we already have access to the Chat._messages attribute

https://github.com/Quansight/ragna/blob/84cf4f627ebf52b061681a7a8b106daef3e79a1d/ragna/core/_rag.py#L168

However, on the API side we never bothered to fill this up when loading a chat from the DB, since the chat so far never accesses the historical messages:

https://github.com/Quansight/ragna/blob/84cf4f627ebf52b061681a7a8b106daef3e79a1d/ragna/deploy/_api/core.py#L236-L240

blakerosenthal commented 1 month ago

I lean towards option 2 as well: history at the chat level, maybe to default to True for all chats via the persistent config or during startup at the CLI level. Chat histories could be loaded in bulk from the DB and toggled on a per-chat basis via the UI.

blakerosenthal commented 1 month ago

It looks like Langchain creates a special input type that is a combination of prompt + message_history, then builds a history-aware retriever that further rephrases the prompt using the history before sending everything to the LLM.

https://python.langchain.com/v0.2/docs/tutorials/qa_chat_history/#adding-chat-history

blakerosenthal commented 1 month ago

The current case would have only a single item in there, whereas with chat history we would have multiple entries. If we go for this implementation, we also need to decide whether we pass strs or ragna.core.Messages and whether we pass a flat list or a list of prompt-answer-pairs. Let's survey the ecosystem and not reinvent the wheel here.

To model things similarly to Langchain, we would replace the prompt arg with a more generic input that would take a newly-defined class instance (i.e. LlmInput). This new class would then handle things like rephrasing (a later feature, perhaps), and formatting the chat history alongside the prompt.

pmeier commented 1 month ago

history at the chat level, maybe to default to True for all chats via the persistent config or during startup at the CLI level.

Aren't the start and end of the sentence contradict each other? If we have the history at the chat level, we cannot have it globally set either through the config file or CLI option.

When doing it on the chat level, we can have a boolean parameter that gets passed during chat creation regardless if that happens at the Python or REST API or the web UI. Defaulting to True is ok for me.

This new class would then handle things like rephrasing (a later feature, perhaps), and formatting the chat history alongside the prompt.

I dislike this approach somewhat. Right now we have components that process data objects, e.g. a source storage gets passed documents and returns sources. The data objects don't have any processing logic on them. IMO this makes it a lot easier to keep the model of a pipeline. Thus, I'd prefer that functionality like rephrasing becomes a component rather than a method on a data object.

Re formatting: I don't think this can even be on the LlmInput class unless we create a subclass of it for each LLM we have. The format that each LLM accepts varies wildly between vendors. Meaning, we should have an abstract object that the LLM translates into its own dialect. And by "abstract object" I don't necessarily mean a custom object. It might still just be a list[str] or the like.

blakerosenthal commented 1 month ago

Aren't the start and end of the sentence contradict each other? If we have the history at the chat level, we cannot have it globally set either through the config file or CLI option.

Maybe I have to look more closely at how the global config interacts with the system. I meant that default history behavior is defined globally (i.e. the user can decide whether chat history is on or off for all new chats), but can be toggled per-chat.

I'd prefer that functionality like rephrasing becomes a component rather than a method on a data object.

This makes a lot of sense to me. Keeping the new Input class as a pure data type would still help abstract away the details from rest of the pipeline, and then a rephrasing component can process these pieces elsewhere.

The format that each LLM accepts varies wildly between vendors. Meaning, we should have an abstract object that the LLM translates into its own dialect.

I'm pretty sure this is how Langchain handles it too.

nenb commented 1 month ago

To model things similarly to Langchain, we would replace the prompt arg with a more generic input that would take a newly-defined class instance (i.e. LlmInput).

I like this, excellent research @blakerosenthal.

Abstracting the prompt like this would also allow us to potentially tackle another (small, but important) problem that has come up - being able to use different system prompts with the same assistant. I am currently doing this by creating multiple assistants, which feels unnecessary.

@pmeier Do we have consensus for adding a new data model LlmInput? If so, what would be the fields that we add on this data model eg user_prompt, system_prompt, chat_history, preprocessed_user_prompt etc?

I propose we add a chat history feature that can be toggled on and off Perhaps we don't even need to add this - as long as we make the messages available to the Assistant, a user can decide whether or not to make use of them when interacting with the LLM.

@pmeier The rest of your proposal regarding implementation seems sensible to me - make use of the _message attribute (perhaps now including it as part of the more general data model LlmInput) and make sure that the API also supports this.

Excited about this!

dillonroach commented 1 month ago

I'll just keep my comments at the higher LLM-interaction level as I have less opinion about how ragna should do it specifically, but with that said:

flexibility should be a key focus, 'the right pattern' isn't static and will both change by task, and over time as models evolve.
because of this, I'd prefer to see a 'build_chat_history' type action/phase rather than simply a 'history' lookup, on/off:
- The chat history should be kept in such a way that we can either compile a 'just seen by user' version or a version with all the added sources and nitty-gritty reprompts/preprompts/etcetc that might go into making the final prompt for inference.
- You would like to be able to act on the context/history before it gets fully presented as well:
  - Say I need to reduce distractors and want a secondary call to an LLM on each set of sources; something like 'summarize these sources, keeping only the pieces most relevant to the surrounding conversation'
  - Alternatively, just as a way to compress history to save on context
  - Help enforce proper response patterns: say you really need an LLM to respond a certain way, and it messes up slightly; if you carry that 'example' forward to your next question, the LLM is more likely to repeat the same incorrect pattern, so grammars and linters would potentially help.
keep in mind that chat history also thereby implies different patterns to embed/lookup:
- if you embed the entire history you'll get all sorts of mess coming back from a DB lookup, but on the other hand, if you only embed the question, and it referred to previous content, you won't actually have the correct information being captured in the prompt embed; e.g. Q1: who is the CEO of x? Q2: tell me more about them. - the embed needs to carry who 'they' are forward; this can be done with parsing/algorithms, or by making a prompt-expansion call on the side (e.g. 'please rephrase the question, given the context, so that all required information is present in the question itself, example1,2,3')
- embedding models have their own context windows and everything being vectorized needs to fit in the (e.g. 8k) context of that model, and not just as it relates to the LLM context window - modification of how/what gets captured plays a role here.

Quansight / ragna

Chat history #421

Feature description

Value and/or benefit

Anything else?