Closed hooman-bayer closed 1 year ago
You are correct we are not leveraging async implementations at the moment. The main reason is that I feel like most code bases are not written in the async paradigm and it is quite hard and not always possible to transition from sync to async.
To mitigate this we currently run agents in different threads so at least one agent will not block the whole app.
As we move forward I would love to see Chainlit support async implementations :)
For streaming, Chainlit already supports streaming with openai
, langchain
and any python code. See https://docs.chainlit.io/concepts/streaming/python :)
@willydouhard happy to contribute at some point in the near future, in case it becomes part of your roadmap, I have been building an app that is fully extending langchain to async
including tools ( using their class signature that offers arun
). But you are 100% right that most of libraries are only offering sync APIs.
For streaming, Chainlit already supports streaming with
openai
,langchain
and any python code. See https://docs.chainlit.io/concepts/streaming/python :)
Correct, I saw it but its again kind of faking :) it and the client needs to wait anyway till the response is completed from OpenAI endpoint that might not be desired. For instance openai.ChatCompletion.acreate
creates an SSE
and directly passes the response to the client token by token as its generated by ChatGPT so the latency is way smaller.
Also imagine in case of your WebSocket for action agents this can bring a lot better experience for the user.
Interesting, in my understanding, openai.ChatCompletion.create
was not waiting for the whole response to be generated to start streaming tokens. Do you happen to have a link to a resource covering that in more details?
To add to the conversation, I tried to use a few different chain classes and couldn't get streaming to work on any of them (only once the response was complete it was updated on screen).
For LangChain, only the intermediary steps are streamed at the moment. If you configured your LLM with streaming=True
you should see the intermediary steps being streamed if you unfold them in the UI (click on the Using...
button).
I will take a look on how to also stream the final response!
@willydouhard see this issue from openai python sdk for more details. In general if you want to keep the tool as a simple POC for only multiple users I see it great as is with sync
but what if we want to scale to 100 users or so? I think running all of this on different threads is not so realistic and modern ( user experience with sync
also wont be great), probably async
is the way to go.
Thank you for the link @hooman-bayer , I pretty much agree with you and we also want to see where the community wants Chainlit to go between staying a rapid prototyping tool or deploy to production and scale.
As for streaming final responses in LangChain @segevtomer I found this interesting issue https://github.com/hwchase17/langchain/issues/2483. I'll dig more into it!
@willydouhard that is using AsyncCallbackHandler
I mentioned above. Using that you get access to on_llm_new_token(self, token: str, **kwargs: Any) -> None:
that then you could customize the return the output token by token to the client
@segevtomer For clarity, all the intermediary steps are already streamed, including the last one which is the final response. Then the final response is sent as a stand alone message (not an intermediary step) in the UI without any overhead so the user can see it.
What I am saying here is that the only improvement we can do is to stream the last tokens after your stop token (usually Final answer:) directly without waiting for the completion to end. This is what https://github.com/hwchase17/langchain/issues/2483 does.
While this would be a win for the user experience, the actual time gain will be very limited, since it only impacts a few tokens at the very end of the whole process.
Thanks for the update @willydouhard. I wouldn't say it's "very limited", I agree that there will still be delay because we have to wait for the final prompt in the chain to occur, however it is still very valuable to stream it. Let's say the final response is over 1k tokens long, streaming that will still be significant for UX.
I have two problems with this:
So what I would need, is either streaming of the final result, or a configurable timeout before the UI loses connection to the server and some spinner to indicate, that something is happening. Preferably both.
@Banbury what is your setup? Are your running open source models like gpt4all locally or are you using openai api?
I have been trying to run Vicuna locally with langchain. It does work more or less, but only for short texts.
So I have seen issues for local models and we are investing them. For API's models everything should work fine. It would be helpful If you can share a code snippet so I can try to reproduce.
This is the code I have been working on. It's simple enough.
import chainlit as cl
from llama_cpp import Llama
from langchain.llms import LlamaCpp
from langchain.embeddings import LlamaCppEmbeddings
from langchain import PromptTemplate, LLMChain
llm = LlamaCpp(model_path="Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_0.bin", seed=0, n_ctx=2048, max_tokens=512, temperature=0.1, streaming=True)
template = """
### Instruction:
{message}
### Response:
"""
@cl.langchain_factory
def factory():
prompt = PromptTemplate(template=template, input_variables=["message"])
llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)
return llm_chain
I have been using the same model with llama.cpp and llama-cpp-python without problems.
Thank you, I am going to prioritize this!
Here is the proposal to move chainlit to async by default https://github.com/Chainlit/chainlit/pull/40. Feedback wanted!
Should be fixed in the latest version 0.3.0
. Please note that it contains breaking changes. We prepared a migration guide to make it easy for everyone.
Hi @willydouhard thanks for your clarification on intermediary streaming. I agree with @segevtomer that streaming final answer to UI would be fair for both long generations and simple chains. Not sure if this the right place to ask but since related issues have all been closed, is final answer streaming still on our roadmap or a feature request shall be made?
It is still on the roadmap but I was waiting for LangChain to come up with a solution for it. This looks promising!
Just came across a new callback handler about streaming final iterator. Not sure how much related but hope it helps.
Hi, does anyone have an answer I am stuck and I posted the issue there :https://github.com/langchain-ai/langchain/issues/10316 Can someone help me ?
This is amazing work! Props to you! A lot of ideas are really future looking such as asking the user for input action!
I was looking into the examples and it seems like the current implementation is not really using asynchronous endpoints For instance:
openai.ChatCompletion.acreate
which is an async generatorAsyncCallbackHandler
This is specially helpful for Agents that can take a longtime to run and might clog the backend
Cheers