Closed AlessandroSpallina closed 10 months ago
Thank you for you well written and reproducible issue.
Except for the llm instantiation (more on that later) the code looks OK and should work. However, running it I noticed a bug in the langchain lamacpp code:
So what I did is to switch back to the sync implementation and wrap it in cl.make_async
. I also moved the llm instantiation out of the cl.on_chat_start
(would happen one time for each user which does not seem necessary, especially for local LLMs).
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
import chainlit as cl
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
@cl.cache
def instantiate_llm():
n_batch = (
4096 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
)
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/willydouhard/Downloads/yarn-llama-2-7b-128k.Q3_K_M.gguf",
n_batch=n_batch,
n_ctx=4096,
temperature=1,
max_tokens=10000,
n_threads=64,
verbose=True, # Verbose is required to pass to the callback manager
streaming=True,
)
return llm
llm = instantiate_llm()
@cl.on_chat_start
def main():
template = """### System Prompt
The following is a friendly conversation between a human and an AI optimized to generate source-code. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know:
### Current conversation:
{history}
### User Message
{input}
### Assistant"""
prompt = PromptTemplate(template=template, input_variables=["history", "input"])
conversation = ConversationChain(
prompt=prompt, llm=llm, memory=ConversationBufferWindowMemory(k=10)
)
cl.user_session.set("conv_chain", conversation)
@cl.on_message
async def main(message: str):
conversation = cl.user_session.get("conv_chain")
cb = cl.LangchainCallbackHandler(
stream_final_answer=True, answer_prefix_tokens=["Assistant"]
)
res = await cl.make_async(conversation)(message, callbacks=[cb])
# Do any post processing here
await cl.Message(content=res["response"]).send()
Then I was able to see the token being streamed to the Chainlit UI. I used the 7B variant of the model you are using but it should work the same.
For the final answer streaming, it only works if the last step of the chain always start with the same prefix (like Final Answer). However if you know your chain only has one step, you can force final answer by manually setting answer_reached to True after instantiating the callback handler and before calling the chain.
cb.answer_reached = True
Many thanks for your fast response, the intermediate streaming worked like a charm! In order to have the final streaming I updated the callback accordingly to your suggestion:
cb = cl.LangchainCallbackHandler(
stream_final_answer=True, answer_prefix_tokens=["Response"]
)
I just replaced answer_prefix_tokens=["Assistant"]
with answer_prefix_tokens=["Response"]
.
This works because the last word of the prompt I'm using is "Assistant" and the LLM always completes the response by returning "Response" at first before actually responding to the question.
It was not the case for the 7B model but the one you use seems smarter!
Hi all, I'm unable to find any snippet related to the usage of LamaCpp and ConversationChain integrated with Chainlit and I'm a bit lost at this point:
Follow the code to reproduce my issues:
Here's my "chainlit run app.py -w" output, as you can see it's clearly written that the callback is never used (?)
Used LLM
Phind-CodeLlama-34B-v2-GGUF
Image 1
Additional info
Appendix:
This code only uses LangChain and proves I have streaming and always proper output