We want to stream the tokens from the final output of the agent.
This allows the user to start reading the response as soon as it starts to get generated, instead of waiting until the whole answer is done.
This should greatly improve the time until a response is visible to the user, especially for longer answers. It would also make the bot feel more “alive”, since it would feel more like someone who's typing in real-time.
Implementation
When I use on_llm_new_token this does not work, because all sort of output is then streamed.
[x] Wait for possible answer on LangChain discord support forum
[x] Wait for implementation of harisrab
Implementation works, but the final answer is only streamed after the final answer was generated fully by the LLM. This looks nice, but does not improve the time on how long the user has to wait for an answer.
[x] Waiting again for response of harisrab
[x] Implement backend
[x] Implement frontend
[x] Final testing
[x] Documentation of new stream over web socket in README
Description
We want to stream the tokens from the final output of the agent.
This allows the user to start reading the response as soon as it starts to get generated, instead of waiting until the whole answer is done.
This should greatly improve the time until a response is visible to the user, especially for longer answers. It would also make the bot feel more “alive”, since it would feel more like someone who's typing in real-time.
Implementation
on_llm_new_token
this does not work, because all sort of output is then streamed.FinalStreamingStdOutCallbackHandler
does not work because it was written for theZERO_SHOT_REACT_DESCRIPTION
agent.Todo's
Resources