Update to real Async and Streaming

hooman-bayer commented 1 year ago

This is amazing work! Props to you! A lot of ideas are really future looking such as asking the user for input action!

I was looking into the examples and it seems like the current implementation is not really using asynchronous endpoints For instance:

OpenAI python SDK offers openai.ChatCompletion.acreate which is an async generator
LangChain offers AsyncCallbackHandler

This is specially helpful for Agents that can take a longtime to run and might clog the backend

Cheers

willydouhard commented 1 year ago

You are correct we are not leveraging async implementations at the moment. The main reason is that I feel like most code bases are not written in the async paradigm and it is quite hard and not always possible to transition from sync to async.

To mitigate this we currently run agents in different threads so at least one agent will not block the whole app.

As we move forward I would love to see Chainlit support async implementations :)

willydouhard commented 1 year ago

For streaming, Chainlit already supports streaming with openai, langchain and any python code. See https://docs.chainlit.io/concepts/streaming/python :)

hooman-bayer commented 1 year ago

@willydouhard happy to contribute at some point in the near future, in case it becomes part of your roadmap, I have been building an app that is fully extending langchain to async including tools ( using their class signature that offers arun). But you are 100% right that most of libraries are only offering sync APIs.

hooman-bayer commented 1 year ago

For streaming, Chainlit already supports streaming with openai, langchain and any python code. See https://docs.chainlit.io/concepts/streaming/python :)

Correct, I saw it but its again kind of faking :) it and the client needs to wait anyway till the response is completed from OpenAI endpoint that might not be desired. For instance openai.ChatCompletion.acreate creates an SSE and directly passes the response to the client token by token as its generated by ChatGPT so the latency is way smaller.

Also imagine in case of your WebSocket for action agents this can bring a lot better experience for the user.

willydouhard commented 1 year ago

Interesting, in my understanding, openai.ChatCompletion.create was not waiting for the whole response to be generated to start streaming tokens. Do you happen to have a link to a resource covering that in more details?

segevtomer commented 1 year ago

To add to the conversation, I tried to use a few different chain classes and couldn't get streaming to work on any of them (only once the response was complete it was updated on screen).

willydouhard commented 1 year ago

For LangChain, only the intermediary steps are streamed at the moment. If you configured your LLM with streaming=True you should see the intermediary steps being streamed if you unfold them in the UI (click on the Using... button).

I will take a look on how to also stream the final response!

hooman-bayer commented 1 year ago

@willydouhard see this issue from openai python sdk for more details. In general if you want to keep the tool as a simple POC for only multiple users I see it great as is with sync but what if we want to scale to 100 users or so? I think running all of this on different threads is not so realistic and modern ( user experience with sync also wont be great), probably async is the way to go.

willydouhard commented 1 year ago

Thank you for the link @hooman-bayer , I pretty much agree with you and we also want to see where the community wants Chainlit to go between staying a rapid prototyping tool or deploy to production and scale.

As for streaming final responses in LangChain @segevtomer I found this interesting issue https://github.com/hwchase17/langchain/issues/2483. I'll dig more into it!

hooman-bayer commented 1 year ago

@willydouhard that is using AsyncCallbackHandler I mentioned above. Using that you get access to on_llm_new_token(self, token: str, **kwargs: Any) -> None: that then you could customize the return the output token by token to the client

willydouhard commented 1 year ago

@segevtomer For clarity, all the intermediary steps are already streamed, including the last one which is the final response. Then the final response is sent as a stand alone message (not an intermediary step) in the UI without any overhead so the user can see it.

What I am saying here is that the only improvement we can do is to stream the last tokens after your stop token (usually Final answer:) directly without waiting for the completion to end. This is what https://github.com/hwchase17/langchain/issues/2483 does.

While this would be a win for the user experience, the actual time gain will be very limited, since it only impacts a few tokens at the very end of the whole process.

segevtomer commented 1 year ago

Thanks for the update @willydouhard. I wouldn't say it's "very limited", I agree that there will still be delay because we have to wait for the final prompt in the chain to occur, however it is still very valuable to stream it. Let's say the final response is over 1k tokens long, streaming that will still be significant for UX.

Banbury commented 1 year ago

I have two problems with this:

There is absolute no indication, that the LLM is doing something, not even a wait cursor
If the text generation runs longer than a few seconds, the UI loses connection to the server, and the message is never displayed.

So what I would need, is either streaming of the final result, or a configurable timeout before the UI loses connection to the server and some spinner to indicate, that something is happening. Preferably both.

willydouhard commented 1 year ago

@Banbury what is your setup? Are your running open source models like gpt4all locally or are you using openai api?

Banbury commented 1 year ago

I have been trying to run Vicuna locally with langchain. It does work more or less, but only for short texts.

willydouhard commented 1 year ago

So I have seen issues for local models and we are investing them. For API's models everything should work fine. It would be helpful If you can share a code snippet so I can try to reproduce.

Banbury commented 1 year ago

This is the code I have been working on. It's simple enough.

import chainlit as cl
from llama_cpp import Llama
from langchain.llms import LlamaCpp
from langchain.embeddings import LlamaCppEmbeddings
from langchain import PromptTemplate, LLMChain

llm = LlamaCpp(model_path="Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_0.bin", seed=0, n_ctx=2048, max_tokens=512, temperature=0.1, streaming=True)

template = """
### Instruction: 
{message}
### Response:
"""

@cl.langchain_factory
def factory():
    prompt = PromptTemplate(template=template, input_variables=["message"])
    llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

    return llm_chain

I have been using the same model with llama.cpp and llama-cpp-python without problems.

willydouhard commented 1 year ago

Thank you, I am going to prioritize this!

willydouhard commented 1 year ago

Here is the proposal to move chainlit to async by default https://github.com/Chainlit/chainlit/pull/40. Feedback wanted!

willydouhard commented 1 year ago

Should be fixed in the latest version 0.3.0. Please note that it contains breaking changes. We prepared a migration guide to make it easy for everyone.

xleven commented 1 year ago

Hi @willydouhard thanks for your clarification on intermediary streaming. I agree with @segevtomer that streaming final answer to UI would be fair for both long generations and simple chains. Not sure if this the right place to ask but since related issues have all been closed, is final answer streaming still on our roadmap or a feature request shall be made?

willydouhard commented 1 year ago

It is still on the roadmap but I was waiting for LangChain to come up with a solution for it. This looks promising!

xleven commented 1 year ago

Just came across a new callback handler about streaming final iterator. Not sure how much related but hope it helps.

Serge9744 commented 9 months ago

Hi, does anyone have an answer I am stuck and I posted the issue there :https://github.com/langchain-ai/langchain/issues/10316 Can someone help me ?

Chainlit / chainlit

Update to real Async and Streaming #7