cpacker / MemGPT

Letta (fka MemGPT) is a framework for creating stateful LLM services.
https://letta.com
Apache License 2.0
12.04k stars 1.33k forks source link

Streaming support? #345

Closed ProjCRys closed 6 months ago

ProjCRys commented 11 months ago

This could be a roadmap so that text output should be streaming as the llm generates the message or thought. A use case I can think for this is would be the implementation of TTS with shorter response time (TTS would speak every sentence generated).

Though this would have to refractor a lot of MemGPT's code as the LLM would generally have to output a JSON but I think this could be solved by having each functions be done by agents. One handles the thought, one handles the message (Both could be using streaming output), and the other would be function calling (The one that doesn't necessarily need text streaming as an output.)

This could also make it easier for developers to make the GUI with the model showing the users the live outputting of the LLMs

cpacker commented 11 months ago

This is definitely on the roadmap - it's a little tricky due to how we use structured outputs, but it's possible.

renatokuipers commented 10 months ago

If you take a look at (for example) LMstudio, there is a little snippet in there, that causes realtime text-streaming.

# Chat with an intelligent assistant in your terminal
from openai import OpenAI

# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

history = [
    {"role": "system", "content": "You are an intelligent assistant. You always provide well-reasoned answers that are both correct and helpful."},
    {"role": "user", "content": "Hello, introduce yourself to someone opening this program for the first time. Be concise."},
]

while True:
    completion = client.chat.completions.create(
        model="local-model", # this field is currently unused
        messages=history,
        temperature=0.7,
        stream=True,
    )

    new_message = {"role": "assistant", "content": ""}

    for chunk in completion:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
            new_message["content"] += chunk.choices[0].delta.content

    history.append(new_message)

    # Uncomment to see chat history
    # import json
    # gray_color = "\033[90m"
    # reset_color = "\033[0m"
    # print(f"{gray_color}\n{'-'*20} History dump {'-'*20}\n")
    # print(json.dumps(history, indent=2))
    # print(f"\n{'-'*55}\n{reset_color}")

    print()
    history.append({"role": "user", "content": input("> ")})

with in particular the part:

    new_message = {"role": "assistant", "content": ""}

    for chunk in completion:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
            new_message["content"] += chunk.choices[0].delta.content

    history.append(new_message)

Maybe this is a good start to get this implemented in memgpt.

I was already looking into it myself, but I can't seem to figure it out on my own I am afraid...

gavsgav commented 10 months ago

I have also played about with the streaming text. Each llm servers have slightly different approaches to this function, but the for loop is key to each. I think the best way to figure it out for each server is to play about with a stand alone script first. Follow the relevant servers docs and then once confirmed, test it out with memgpt.

spjcontextual commented 6 months ago

Have a similar issue here with vLLM. For now my work around might just be wait for a full generation by Mem and then do a fake delay which iterates over the assistant_message output and streams that back to my client.