SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.44k stars 327 forks source link

Question: prompting and InteractiveModeExecute.cs #156

Closed mphacker closed 11 months ago

mphacker commented 11 months ago

I am trying to understand the best way to setup prompts and the library for an interactive chat session. It looks like based on the InteractiveModeExecute.cs example, the "bob" personality is only described once and then the interactive session runs from that. Wouldn't the concept of bob be lost over time as the conversation goes on?

I am working on a scenario where I have a multi user bot with a defined system prompt that describes the bot's behavior and establishes guidelines on how to respond.

When a person sends a message to the bot, I am using Semantic Kernel and Azure OpenAI to handle a vector search to see if I have any relevant information stored that would apply and help the bot properly respond.

Ultimately, I end up building a larger prompt that includes the system prompt, embedding results text, and finally the user's input. I pass all of this to the InferAsync method of the InteractiveExecutor. This was the only way I could figure out to do it where the model would have all of the data necessary to form a response.

I am not using the ChatSession object for this process since the bot may receive messages from different users and I really don't need the bot to respond based on past messages.

What I am seeing is that I get results for the first executor.InferAsync, but as new messages comes in the InferAsync just results empty results. The only way I have found to stop that from happening is to save off a clean state from the executor once that object is created and then load that state prior to each InferAsync call. That doesn't seem right.

I have tried the StatelessExecutor but I get some really odd results that seem to be crazy ramblings that really don't take much of the full prompt into consideration.

Any thoughts on my approach and what I should do different. I am considering trying to use the ChatSession object and then saving a session off per chat user. When a user sends in a message, I could check to see if there is a past session that could be loaded and continue the conversation from there.

mphacker commented 11 months ago

Digging into the ChatSession class I see a couple of things that doesn't make sense to me.

  1. It looks like there is a Chat method that takes a prompt and InferenceParams but doesn't seem to do anything with past history as far as prompting. It just adds the message to a Message collection in a History object. So calling the Chat method does not add any history back into the prompt being sent to the model.
  2. There is a ChatAsync method that does take in a ChatHistory object and generates a prompt from only the history. When would this be used? Why would I infer from only past history and not include the user prompt?
  3. There is a final ChatAsync method which is just an async version of the first method in this list.

I am really having a hard time understanding how / why to use the ChatSession object as it is. To me, we would want to generate prompts that include some past history along with the user prompt. We would also need to ensure that the total prompt size plus the number of tokens we expect in return does not exceed the context size. In a case where it would exceed, we would want to reduce the amount of history being shared with the prompt.

I am sure I am misunderstanding either something with the library or how the Llama model works. Anyone have some guidance or advice?

martindevans commented 11 months ago

I don't know the ChatSession stuff very well at all (I've mostly been doing work at the much "lower levels" of the library), but I think I can answer some of these questions.

So calling the Chat method does not add any history back into the prompt being sent to the model.

If you look into the executors (e.g. LLamaInteractExecutor, line 165) you'll find something like this:

_pastTokensCount = Context.Eval(_embeds, _pastTokensCount);

When the model is evaluated you pass in embeds (i.e. your prompt) and _pastTokensCount (i.e. how many tokens from the past prompts to use).

So the history is stored in the context itself. I think this fact is really important to understand as it probably changes the rest of your questions somewhat.

There is a ChatAsync method that does take in a ChatHistory object and generates a prompt from only the history...

Yeah this looks very weird to me. It'll transform the entire history into text and run inference on it, but as discussed above most of that history should already be in the context.

The only use I can see for this is if you've just created a new ChatSession - you could pass in a ChatHistory to initialise it. Even if that is the intended usecase it seems like a bad design to me - it's too easy to misuse by passing in a ChatHistory after initialisation and polluting your context with a whole load of repeated tokens.

There is a final ChatAsync method which is just an async version of the first method in this list.

This is a pattern that's used a few times throughout the library - e.g. executors have Infer and InferAsync which do the same thing, ITextStreamTransform has Transform and TransformAsync which do the same thing.

I'm hoping to clean this up at some point and have just one. I haven't come up with a good way to do that yet though. Inference itself isn't async, but other things that happen during inference are (e.g. if state gets read/written during inference we want to do that asynchronously), but only having an async API isn't very accessible.

To me, we would want to generate prompts that include some past history along with the user prompt.

It sounds to me like you may want to work more at the level of an executor? The internals of the executors manage converting text->tokens, passing those tokens to the model and automatically handling when there are too many tokens (see HandleRunOutOfContext in LLamaExecutorBase). Right now this isn't very accessible, and it's a long term goal of mine to improve the executors (make them easier to extend and easier to use).

A Final Note

Another option is to look at the SemanticKernel based LLamaSharpChatCompletion which was added recently. I believe that serves a similar purpose.

mphacker commented 11 months ago

Thank you for your response. I am still very new to a lot of this, but trying to learn.

So for my scenario, I am trying to have a shared chat bot that has a basic personality and uses a vector database to look up related content to augment the prompt. The bot sits in a chat room and when a person tags the bot, it will respond back to their message. This means that multiple users can chat with the bot via the chat room and it's responses are seen in the chat room by everyone.

So here is where I start making things very complex in my head. The sum of tokens for the prompt and the response needs to be less than the token context size of the model, correct? So that means I have to be very intentional in how I generate a prompt. My prompt needs to include a system message of sorts that defines the personality and expected results from the model. It is basically the instructions on how it should respond. I may include some facts that are coming from an external lookup, like a vector database to support the user's question/request. I will possibly want to include some past history of the conversation so the bot seems to build off of prior messages. Finally, I need to have the user's input.

With LlamaSharp I am struggling a bit on how best to do this. I have tried using a StatelessExecutor and sending in my fully formed prompt based on what I stated above. Unfortunately, I tend to get a lot of nonsense in the reply. The InteractiveExecutor generally gives some good results, but tends to forget about history and sometimes comes up with some really weird multi-turn conversations, making up names of other users.

When using the InteractiveExecutor, if I do not reset the state to a clean state each call it also tends to give no response from the Infer method after the first response.

I am sure I am just going about this wrong, however, I am trying to follow the examples in this repo. I noticed that the examples in the repo do not keep sending a system message in the prompt. That is fine, but at some point enough tokens will flow through the prompts that the past tokens being included in the eval will no longer include the system prompt details.

I would love to see a good example where a conversation loop is occurring with a user and the bot, where the system message / instructions don't get pushed out of the model over time.

I have been using Semantic Kernel for the embedding. Maybe I should just try to update my code to use SK integration with LlamaSharp and see how that goes.

martindevans commented 11 months ago

That is fine, but at some point enough tokens will flow through the prompts that the past tokens being included in the eval will no longer include the system prompt details.

This should be handled by the executor automatically. When it fills up the context it will keep some tokens from the start of the prompt (TokensKeep property on inference params) and then will keep the last 50% of the tokens.

So for example if your context length is 10 and your initial system prompt is 4 long (and you set TokensKeep to 4) a history of: AAAABCDEFG would become AAAAFG____.

This isn't the ideal way to handle it by any means! But at least the system prompt should not be lost.

I would love to see a good example where a conversation loop is occurring with a user and the bot, where the system message / instructions don't get pushed out of the model over time.

Given the above, any of the first 6 example programs should do this.

I have tried using a StatelessExecutor and sending in my fully formed prompt based on what I stated above. Unfortunately, I tend to get a lot of nonsense in the reply.

This seems like it should work, although it would be very slow since you're re-evaluating the entire history every time instead of keeping that cached in the context. If you try out the same prompt in llama.cpp (with main.exe) and don't get nonsense then that's a bug we need to fix (if you do get nonsense then it's probably some kind of prompting error).

When using the InteractiveExecutor, if I do not reset the state to a clean state each call it also tends to give no response from the Infer method after the first response.

Do you see this behaviour in the demos? This sounds like it may be a bug (although it is possible for the model to decide not to reply, I've found it to be rare).

mphacker commented 11 months ago

Thank you so much for this info! I am sure this will help others too. What is the best way to determine the number of tokens I have in my system prompt so I can properly set the TokensKeep to the right value? I have been doing a bunch of trial and error and have had some great success and some disasters. LOL. The work you are all putting in here is amazing. Thank you again!

I am going to take a step back and rework some code based on what you have shared. I will let you know if I am still seeing issues with the model not giving a response.

martindevans commented 11 months ago

I think the only way to know the number of tokens is to tokenize it yourself before you submit it to the model.

Something like this

var executor; // your executor
var tokens = executor.Context.Tokenize(system_prompt);
var tokens_keep = tokens.Length;
mphacker commented 11 months ago

Here is my thought process. Take my system prompt and get the number of tokens. The first time I send a message to the model I will include the system prompt at the top of the prompt being sent to InferAsync. When calling InferAsync I set the TokensKeep to the system prompt token size. For all future calls to InferAsync I do not need to include the system prompt, but I do need to keep telling InferAsync the number of TokensKeep.

Sound right?

BTW, was using the nous-hermes-llama-2-7b.Q5_K_M.gguf weights and with it I do end up getting blank responses after a few messages unless I reset the executor state. It seems like the llama-2-7b-guanaco-qlora.Q5_0.gguf weights are not having that issue.

martindevans commented 11 months ago

Yep that sounds right 👍

martindevans commented 11 months ago

I'll close this one for now since it looks like it's been resolved, but feel free to re-open it.