getnamo / Llama-Unreal

Llama.cpp plugin for Unreal Engine 5
MIT License
33 stars 5 forks source link

Allow changing conversation state via model state #16

Open getnamo opened 2 months ago

getnamo commented 2 months ago

Related: https://github.com/getnamo/Llama-Unreal/issues/5

References:

Check that model can recall information that's in prompt history during chat session (not post reset).

Current suspicion is that chat history is not synced internally with model.

https://github.com/getnamo/Llama-Unreal/blob/6da6d73dcc8fb984438f5cf0e7774a56d07d3c52/Source/LlamaCore/Public/LlamaComponent.h#L113 to https://github.com/getnamo/Llama-Unreal/blob/6da6d73dcc8fb984438f5cf0e7774a56d07d3c52/Source/LlamaCore/Private/LlamaComponent.cpp#L127?

At https://github.com/getnamo/Llama-Unreal/blob/6da6d73dcc8fb984438f5cf0e7774a56d07d3c52/Source/LlamaCore/Private/LlamaComponent.cpp#L388 for input

jawadato commented 2 months ago

Chat history is indeed synced during the session.

Only parts of the conversation which is done during the session pre-reset is maintained however. Setting the value of ModelState.PromptHistory to a template wrapped string prior to activating the component has no effect it seems, the model discards prior conversations post reset, only newly appended tokens during the active session are remembered.

If this is the expected behavior then there are no bugs.

However, the ability to save and load the conversational history would be really powerful. Currently, the way I am doing it is by storing a value of ModelState.PromptHistory to a string variable that is retained between resets. Prior to activating the component, the initial prompt (ModelParams.Prompt) is set to this value which then allows the model to load that information into session memory.

A better way of doing this would be to actually take the template wrapped history string and append the new prompt to it and send this whole string for generation but do it every prompt, causing the prompt string to get longer and longer as the session progresses. In this circumstance, it is required that the context is reset with each prompt so that the context length doesn't grow unnecessarily. Because the whole message history is being sent, the model remains coherent and can recall information from prior conversations even in between resets.

From my understanding, it's how the OpenAI API does it which has been adapted by some of the llamacpp implementations that I have used so far, take a look here -

https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#chat-completions

In the case of the webui, whole conversational history is sent with each API request and it generates a new context each time if I recall correctly.

The advantage of this approach is that it allows the prompt messages to be edited at runtime to guide the conversation towards a certain direction, preventing the model from hallucinating and maintaining a certain writing style by replacing choice words.

In my current solution, the conversation messages can only be modified once prior to activating the component instead of being able to modify it before sending each new prompt.

I think an ability to reset the context easily without unloading the model or resetting the static parameters would allow for a better way to retain prompt history by resetting the context before sending the prompt history string with the new prompt appended to it.

getnamo commented 2 months ago

There's a performance implication with recalculating the whole prompt each time (compute bound -> generating KV cache). That said supporting history storage and reply regeneration is absolutely an API that should be supported.

This might need a naive implementation to start where if the history changes, run full recalculation, otherwise continue with current kv cache.

getnamo commented 2 months ago

I think in the original API approach I was largely seeing FLLMModelState as a pure transform accessible on the game thread of the internal model state. Not sure if modifying it won't convolute the access patterns expected. In contrast FLLMModelParams are the settings you push into the model before inference. That said it might make more sense to just have FLLMModelState be a modifiable state, but it should probably be a function call e.g. ModifyModelState() or something of the like.