Closed ProjCRys closed 6 months ago
This is definitely on the roadmap - it's a little tricky due to how we use structured outputs, but it's possible.
If you take a look at (for example) LMstudio, there is a little snippet in there, that causes realtime text-streaming.
# Chat with an intelligent assistant in your terminal
from openai import OpenAI
# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")
history = [
{"role": "system", "content": "You are an intelligent assistant. You always provide well-reasoned answers that are both correct and helpful."},
{"role": "user", "content": "Hello, introduce yourself to someone opening this program for the first time. Be concise."},
]
while True:
completion = client.chat.completions.create(
model="local-model", # this field is currently unused
messages=history,
temperature=0.7,
stream=True,
)
new_message = {"role": "assistant", "content": ""}
for chunk in completion:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
new_message["content"] += chunk.choices[0].delta.content
history.append(new_message)
# Uncomment to see chat history
# import json
# gray_color = "\033[90m"
# reset_color = "\033[0m"
# print(f"{gray_color}\n{'-'*20} History dump {'-'*20}\n")
# print(json.dumps(history, indent=2))
# print(f"\n{'-'*55}\n{reset_color}")
print()
history.append({"role": "user", "content": input("> ")})
with in particular the part:
new_message = {"role": "assistant", "content": ""}
for chunk in completion:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
new_message["content"] += chunk.choices[0].delta.content
history.append(new_message)
Maybe this is a good start to get this implemented in memgpt.
I was already looking into it myself, but I can't seem to figure it out on my own I am afraid...
I have also played about with the streaming text. Each llm servers have slightly different approaches to this function, but the for loop is key to each. I think the best way to figure it out for each server is to play about with a stand alone script first. Follow the relevant servers docs and then once confirmed, test it out with memgpt.
Have a similar issue here with vLLM. For now my work around might just be wait for a full generation by Mem and then do a fake delay which iterates over the assistant_message output and streams that back to my client.
This could be a roadmap so that text output should be streaming as the llm generates the message or thought. A use case I can think for this is would be the implementation of TTS with shorter response time (TTS would speak every sentence generated).
Though this would have to refractor a lot of MemGPT's code as the LLM would generally have to output a JSON but I think this could be solved by having each functions be done by agents. One handles the thought, one handles the message (Both could be using streaming output), and the other would be function calling (The one that doesn't necessarily need text streaming as an output.)
This could also make it easier for developers to make the GUI with the model showing the users the live outputting of the LLMs