Maximilian-Winter / llama-cpp-agent

The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output. Works also with models not fine-tuned to JSON output and function calls.
Other
445 stars 38 forks source link

Stop LLM output on user request? #47

Open woheller69 opened 2 months ago

woheller69 commented 2 months ago

Is there a way to stop inference manually? E.g. such as by returning FALSE to the streaming_callback? If the user presses the stop button in a UI how could that be handled?

Maximilian-Winter commented 2 months ago

I'm not sure how to do it properply in llama_cpp_python but it should be possible. Will add this ASAP

pabl-o-ce commented 2 months ago

Is possible to use break keyword or if you use request you can also have a signal control to control finish the request

woheller69 commented 2 months ago

it is not about a keyword. If a long text is generated and it goes the wrong direction I want to stop it without losing the context by killing the process. The python bindings of gpt4all e.g. of have a callback similar to streaming_callback. If True is returned, it continues, if False is returned it stops. In this callback I can check if a button has been pressed an then send True/False.

woheller69 commented 2 months ago

I need this for a local model, just in case this makes a difference

woheller69 commented 2 months ago

It seems there is a PR for llama-cpp-python regarding this: https://github.com/abetlen/llama-cpp-python/pull/733/files

Add cancel() method to interrupt a stream

But they do not want to merge it

There is also an issue: https://github.com/abetlen/llama-cpp-python/issues/599

pabl-o-ce commented 2 months ago

call me a mad man but I just use like this example to end the inference

for chunk in llm.stream_chat(chat_template):
    if cancel_flag is True:
        break
woheller69 commented 2 months ago

Doesn't that just break the for loop but the llm continues to stream?

Currently I have:

    llama_cpp_agent.get_chat_response(
        user_input, 
        temperature=0.7, 
        top_k=40, 
        top_p=0.4,
        repeat_penalty=1.18, 
        repeat_last_n=64, 
        max_tokens=2000,
        stream=True,
        print_output=False,
        streaming_callback=streaming_callback
    )

And in the streaming_callback I am printing the tokens as they come. Ideally this callback could return True/False to continue/stop

pabl-o-ce commented 2 months ago

let me create some test for this

woheller69 commented 2 months ago

In case there is no "clean" solution via llama_cpp_python, I found a solution using a thread_with_exception as in my code https://github.com/woheller69/LLAMA_TK_CHAT/

It starts inference in a separate tread and stops it by raising an exception. But that way the partial answer is not added to chat history (I am doing this later using add_message(...) in my code) because I am having llama_cpp_agent.get_chat_response(...) in this thread. It certainly would be better if that was realized INSIDE llama_agent.py, maybe in get_chat_response(...) or get_response_role_and_completion(...) such that the partial answer can still be added to history.

If my code doesn't look great, this is because I have no clue about Python :-)

jewser commented 2 months ago

For those interested, here is an minimal adaptation of @woheller69's workaround:

from llama_cpp import Llama
import threading
import sys

# https://github.com/woheller69/LLAMA_TK_CHAT/blob/main/LLAMA_TK_GUI.py
class thread_with_exception(threading.Thread):
    def __init__(self, name, callback):
        threading.Thread.__init__(self)
        self.name = name
        self.callback = callback

    def run(self):
        self.callback()

    def get_id(self):
        # returns id of the respective thread
        if hasattr(self, '_thread_id'):
            return self._thread_id
        for id, thread in threading._active.items():
            if thread is self:
                return id

    def raise_exception(self):
        thread_id = self.get_id()
        if thread_id != None:
            res = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), ctypes.py_object(SystemExit))
            if res > 1:
                ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), 0)

llm = Llama(
    model_path="../../llama.cpp/models/Meta-Llama-3-8B/ggml-model-f16.gguf",
    n_gpu_layers=-1,
    lora_path="../../llama.cpp/models/test/my_lora_1350.bin",
    n_ctx=1024,
)

def generate(prompt):
    for chunk in llm(
        ''.join(prompt),
        max_tokens=100,
        stop=["."],
        echo=False,
        stream=True,
    ):
        yield chunk["choices"][0]["text"]

def inference_callback():
    prompt = "juicing is the act of "

    print(prompt,end='')
    sys.stdout.flush()
    for chunk in generate([prompt]):
        print(chunk,end='')
        sys.stdout.flush()
    print()

inference_thread = thread_with_exception("InferenceThread", inference_callback)
inference_thread.start()

import time
try:
    for i in range(20):
        time.sleep(0.5)
    print("done normally")
except KeyboardInterrupt:
    inference_thread.raise_exception()
    inference_thread.join()
    print("interrupted")

Here we have an inference thread that may be interrupted by the main thread which is busy doing something else (presumably listening as a webserver or a gui window or something), though in this case it is just sleeping for 10 seconds.

42PAL commented 2 months ago

Using LM Studio to run the models works for me. I often stop the generation, edit the AI mistakes and steer it in the direction I want, save the changes and then have it continue generating. This seems to work for me on all models I have tried while using LM Studio App.