SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.26k stars 300 forks source link

SemanticKernel ChatCompletion is Stateless #614

Open kidkych opened 3 months ago

kidkych commented 3 months ago

So I'm interested in using the LLamaSharp ChatCompletion interface for Semantic Kernel for an application I'm working on and I've been looking over the implementation and have some concerns regarding performance and chat state. I've already had this conversation with @martindevans a month or so ago on discord, you can view the convo here.

I haven't had any time to work on it until now. I'm going to have a look at fixing it, but the primary issue is the use of the StatelessExectutor for ChatCompletion. This basically leads to extremely long inference times as the chat history builds up since the model has to process the entire context on each inference call. I remember earlier that the implementation used to just wrap the existing llamasharp ChatSession class but that made it difficult to manage conversation state, i.e. deleting the last message and changing it to generate a different response.

I'm considering writing a dedicated ChatCompletion for semantic-kernel that doesn't use LLamaSharp's internal chat history but instead interfaces directly with the ChatHistory defined by Semantic Kernel. Are there any thoughts from @AsakusaRinne and @xbotter? Tagging you two since Rinne made the initial implementation and xbotter rewrote it and made the newer one.

xbotter commented 3 months ago

The interfaces of Semantic Kernel are usually oriented towards OpenAI's API, so the ChatCompletion interfaces are stateless. This is indeed a challenge for locally deployed models. This seems inevitable for the Semantic Kernel that is only used to execute Plugins.

It is indeed a good idea to instead interface directly with the ChatHistory. Can you talk about more details?

PrestigeDevop commented 3 months ago

could you upload your project when it's done , I'm looking for a SK kernel with local models .. so it would be nice to have a boilerplate code , I also consider onnx runtime .. if you have time to share your thoughts

AsakusaRinne commented 3 months ago

@kidkych Hi, sorry for the late reply. I think it's a good idea! More generally, it is not only a semantic-kernel integration issue but also something about LLamaSharp itself. What is needed is a way to flexibly manage the history with corresponding kv-cache. In fact, InteractiveExecutor is a special case, which manages the kv-cache by always keeping it. Using the stateless executor will not be a bad option if kv-cache could be added for its history. However it's still unclear that how to design it. For example, should kv-cache be related with executor or history? Or maybe both of them?

I would really appreciate it if you would like to help with that. Completing the implementation of it is not the only way to contribute for that. Instead, it will help a lot if you could share your idea here and I'm always eager to discuss with you.

Here is the native APIs for kv cache management: link

martindevans commented 3 months ago

kv cache management

The BatchedExecutor exposes all of the kv cache stuff, per "Conversation". So e.g. you can shift off token, or rewind state etc. That should be a good base to build off for future higher level executors.

kidkych commented 2 months ago

Hey guys, sorry about the delay been busy with work.

So I think the long term solution is definitely using the BatchedExecutor. In addition to supporting kv cache management as @martindevans is talking about, being able to manage multiple conversations in the case where it's being used for a service with multiple simultaneous users makes sense. I'm going to read through the relevant code and see what I can come up with.

As a temporary stop-gap however I've opened PR #671 after looking at InteractiveExecutor mentioned by @AsakusaRinne . There should not be any breaking changes here since the StatelessExecutor code path is untouched. When using a StatefulExecutorBase the responses should be faster as it is no longer passing the entire history through the model, just the most recent message since the history is preserved within the model tokens. I tested it with specific questions about earlier responses and it manages to maintain context. The downside is that conversation history cannot be changed.

@xbotter can you elaborate on what you mean by Semantic Kernel's interfaces are Stateless? Are you referring to the fact that they expect the entire ChatHistory with every call to GetChatMessageContentsAsync or GetStreamingChatMessageContentsAsync? If so, we can maintain the interfaces and then manage the state internally, i.e. using Semantic Kernel's ChatHistory and the BatchedExecutor.