Open woheller69 opened 2 months ago
I'm not sure how to do it properply in llama_cpp_python but it should be possible. Will add this ASAP
Is possible to use break
keyword or if you use request you can also have a signal control
to control finish the request
it is not about a keyword. If a long text is generated and it goes the wrong direction I want to stop it without losing the context by killing the process. The python bindings of gpt4all e.g. of have a callback similar to streaming_callback. If True is returned, it continues, if False is returned it stops. In this callback I can check if a button has been pressed an then send True/False.
I need this for a local model, just in case this makes a difference
It seems there is a PR for llama-cpp-python regarding this: https://github.com/abetlen/llama-cpp-python/pull/733/files
Add cancel() method to interrupt a stream
But they do not want to merge it
There is also an issue: https://github.com/abetlen/llama-cpp-python/issues/599
call me a mad man but I just use like this example to end the inference
for chunk in llm.stream_chat(chat_template):
if cancel_flag is True:
break
Doesn't that just break the for loop but the llm continues to stream?
Currently I have:
llama_cpp_agent.get_chat_response(
user_input,
temperature=0.7,
top_k=40,
top_p=0.4,
repeat_penalty=1.18,
repeat_last_n=64,
max_tokens=2000,
stream=True,
print_output=False,
streaming_callback=streaming_callback
)
And in the streaming_callback I am printing the tokens as they come. Ideally this callback could return True/False to continue/stop
let me create some test for this
In case there is no "clean" solution via llama_cpp_python, I found a solution using a thread_with_exception
as in my code https://github.com/woheller69/LLAMA_TK_CHAT/
It starts inference in a separate tread and stops it by raising an exception. But that way the partial answer is not added to chat history (I am doing this later using add_message(...) in my code) because I am having llama_cpp_agent.get_chat_response(...) in this thread. It certainly would be better if that was realized INSIDE llama_agent.py, maybe in get_chat_response(...) or get_response_role_and_completion(...) such that the partial answer can still be added to history.
If my code doesn't look great, this is because I have no clue about Python :-)
For those interested, here is an minimal adaptation of @woheller69's workaround:
from llama_cpp import Llama
import threading
import sys
# https://github.com/woheller69/LLAMA_TK_CHAT/blob/main/LLAMA_TK_GUI.py
class thread_with_exception(threading.Thread):
def __init__(self, name, callback):
threading.Thread.__init__(self)
self.name = name
self.callback = callback
def run(self):
self.callback()
def get_id(self):
# returns id of the respective thread
if hasattr(self, '_thread_id'):
return self._thread_id
for id, thread in threading._active.items():
if thread is self:
return id
def raise_exception(self):
thread_id = self.get_id()
if thread_id != None:
res = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), ctypes.py_object(SystemExit))
if res > 1:
ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), 0)
llm = Llama(
model_path="../../llama.cpp/models/Meta-Llama-3-8B/ggml-model-f16.gguf",
n_gpu_layers=-1,
lora_path="../../llama.cpp/models/test/my_lora_1350.bin",
n_ctx=1024,
)
def generate(prompt):
for chunk in llm(
''.join(prompt),
max_tokens=100,
stop=["."],
echo=False,
stream=True,
):
yield chunk["choices"][0]["text"]
def inference_callback():
prompt = "juicing is the act of "
print(prompt,end='')
sys.stdout.flush()
for chunk in generate([prompt]):
print(chunk,end='')
sys.stdout.flush()
print()
inference_thread = thread_with_exception("InferenceThread", inference_callback)
inference_thread.start()
import time
try:
for i in range(20):
time.sleep(0.5)
print("done normally")
except KeyboardInterrupt:
inference_thread.raise_exception()
inference_thread.join()
print("interrupted")
Here we have an inference thread that may be interrupted by the main thread which is busy doing something else (presumably listening as a webserver or a gui window or something), though in this case it is just sleeping for 10 seconds.
Using LM Studio to run the models works for me. I often stop the generation, edit the AI mistakes and steer it in the direction I want, save the changes and then have it continue generating. This seems to work for me on all models I have tried while using LM Studio App.
Is there a way to stop inference manually? E.g. such as by returning FALSE to the streaming_callback? If the user presses the stop button in a UI how could that be handled?