Streaming assistant responses

pmeier commented 10 months ago

Some APIs like OpenAI and Anthropic support streaming the generated response. This is a really nice UX, because the user gets instant feedback and doesn't have to wait for a long time to receive a large blob of text.

However, due to Ragnas backend structure, this is either "impossible" (read infeasible for the upside) or really hard. The problem is that we receive the response on the worker, but consume it on the API side. In between them is our result storage. Meaning, we would need to somehow implement a streaming approach for the results and finally also implement streaming on our own API.

I really would like to have this feature and toyed with it in the past. Unfortunately, I gave up at some point for the reasons stated above. If someone wants to have a go at this, be my guest. But be warned that this likely will turn really ugly.

petrpan26 commented 10 months ago

It makes sense to me ! Streaming is definitely something that is hard. My initial thought is to used it as a additional check with timeout maybe at request level. My thought is for the client dont shut down the connection while they still fetching it. But I will revisit this when I have more familiarity with the code base. Is there any issue that you think I can help with ?

pmeier commented 10 months ago

If we just want to circumvent the timeout on the worker side, we could still implement streaming there. It wouldn't make a difference to the user as we would only return if we have the full response, but since we should get an initial response super fast, we could maybe lower the timeout again.

nenb commented 10 months ago

Does communicating with the LLM (answer()) need to be done via the worker? I understand that the embeddings (prepare()) absolutely need this, as they are compute-intensive. But communicating with the LLM is more of an I/O bound problem, and I've had no problem running it on the main thread (asynchronously of course) in previous applications.

This would then make it a lot easier to handle the streaming use-case (which I agree is absolutely essential for RAG). There might need to be a few more changes to deal with the fact that the streamed response will return an Iterable, but I don't see this as being a major issue right now.

petrpan26 commented 10 months ago

I assume that's there can also be self hosted LLM on the computer ? But I agreed. Since the final output of Rag is a synthesis step it should be able to support streaming if the underlying LLM support it

pmeier commented 10 months ago

@petrpan26 is right about the reason. We use the same abstraction for local LLMs as for hitting an external API. Thus, everything goes through the task queue.

@nenb

is absolutely essential for RAG

I disagree here. Streaming is nice to have and not essential. Everything works perfectly fine without streaming.

nenb commented 10 months ago

@petrpan26 is right about the reason. We use the same abstraction for local LLMs as for hitting an external API. Thus, everything goes through the task queue.

Could one of you explain to me why we need a worker/task queue specifically for a local LLM? I don't follow the logic here.

If there's any interest, I've added a minimal PoC of streaming here. Rather than calling the answer() method on the consumer, I called it on the producer. I added the async API so that this wouldn't cause problems for the producer. So, this is what's confusing me about the mention of local LLMs - which part of this wouldn't work with local LLMs?

Although I don't think any of this is particularly complex, it would likely imply some fairly major re-structuring of the code eg making the assistants truly async. So, I'm not too sure how realistic it is. Having said that:

I disagree here. Streaming is nice to have and not essential. Everything works perfectly fine without streaming.

Apologies, I might have come across a bit strong here in my excitement! What I meant really was that for slower connections/longer responses/self-hosting computation restraints, it's not unrealistic to be waiting beyond 30 s for the full response to be generated. I personally find this to be a real producitivity sink, as it's enough time for me to get distracted, or go elsewhere. Streaming is great because it keeps my engagement the entire time, even for slow connections etc. So, everything still works, but I think it's a question of how much value is lost.

pmeier commented 9 months ago

Making some progress on this in #215. One thing that I'm not super sure yet is the type of the individual "chunk" of the message that we return. We cannot just return a str here, but somehow also include the sources.

For the Python API, we could just return multiple Message's. Meaning, users can piece the full message together with something like:

from ragna.core import Message

chunks = []
async for message_chunk in chat.answer(prompt):
    chunks.append(message_chunk)
    # do something with the chunk here

message = Message(
    content="".join(chunk.content for chunk in chunks),
    role=chunks[0].role,
    sources=chunks[0].sources,
)

Not to happy with it. Feels kinda clunky. I thought about having class method like Message.from_chunks, but now we are introducing another "chunk" term into the Ragna lingo. We already have that for

https://github.com/Quansight/ragna/blob/0e1c35a20259fc55615ef303e7c90a9700eb377b/ragna/source_storages/_vector_database.py#L44-L48

For the REST API, the situation is similar. My current thought is to return something like

class AnswerOutput(BaseModel):
     content: Optional[str]
     message: Optional[Message]

While streaming, we would have AnswerOutput(content=..., message=None). When we are done streaming, we create the full message including the sources and return it like AnswerOutput(content=None, message=...). I think this change in the object internals isn't too bad and somewhat normal in the frontend world.

Any input appreciated.

nenb commented 9 months ago

Two other OSS packages that I am reasonably familiar with (and have reasonably wide impact based on stars and social media discussion) have adopted similar designs to what you have just described.

simpleaichat is similar to your first proposal - see the stream method here, and a notebook (cells 20-22) here.

llm seems more similar to your second proposal - see the __iter__, _force, __str__ and text methods here and some relevant documentation here.

I personally prefer the second approach. Asking a user to build Message objects themselves feels a bit clunky like you said. It feels kind of natural to me to have an AnswerOutput object that switches between two different states, depending on whether the stream has been exhausted or not. Is there a reason that a similar approach cannot also be used for the Python API?

About terminology: my two cents is that I agree with you about trying to avoid using chunks in multiple places, but that this is already widely done (see the two previous packages for examples) and so it's probably swimming against the tide.

(#215 looking good!)

pmeier commented 9 months ago

I personally prefer the second approach.

Indeed. I think a good blue print that we can use here is how HTTP responses from requests or httpx work. My current idea for the Python API is something along the lines of

pass an iterator as content to the message
implement __iter__ to return the items of the iterator
make the .content attribute a function, i.e. .content()
invoking .content() or .__str__() should exhaust the iterator and set an internal _content attribute for caching
If __iter__() is called and the iterator was already exhausted we error out and point users to .content() instead

With that, one can do

message = chat.answer(...)
print(message)

or

message = chat.answer(...)
for chunk in message:
    print(chunk)

In both cases they still have access to the other attributes.

As for the REST API: I kinda like a similar approach, which is what the LLM API providers do as well. You always return the full JSON object, but if one hits the endpoint with stream=True, the content field actually is just a chunk and not the full response. In that case the user has to built the full message by themselves, but I feel this ok given that they explicitly set stream=True.

pmeier commented 7 months ago

Closed in #215.

Quansight / ragna

Streaming assistant responses #185