NotImplementedError: Async generation not implemented for this LLM.

prasad4fun commented 1 year ago

Issue Description

Problem: When attempting to use the RetrievalQA module with a custom finetuned Llama model and enable streaming, the following error occurs:

NotImplementedError: Async generation not implemented for this LLM.

Steps to Reproduce

Enable streaming using the provided code:

streamer = TextStreamer(tokenizer)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_length=2048,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.15,
streamer=streamer
)

Instantiate a RetrievalQA object using a custom Llama model with the from_chain_type method, specifying the necessary parameters.

qa_chain = RetrievalQA.from_chain_type(
llm=llm_model,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
verbose=True,
)

Attempt to stream in chainlit using the following code:

@cl.langchain_factory(use_async=True)
def main():
return qa_chain

Expected Behavior

I expected the streaming functionality to work with the custom Language models and not encounter the NotImplementedError.

Additional Information

I have tested streaming with langchain in the command line, and it prints tokens correctly.
The error occurs specifically when using the code provided above.

Please suggest the appropriate way to achieve streaming with custom Language models.

willydouhard commented 1 year ago

Hello, it means the agent / tools you are using do not have an async implementation. You can fall back to the sync implem by just changing use_async=False. You will still be able to stream!

prasad4fun commented 1 year ago

Hello, it means the agent / tools you are using do not have an async implementation. You can fall back to the sync implem by just changing use_async=False. You will still be able to stream!

I followed your suggestion and set use_async=False. Now, I can see tokens being printed one at a time in the terminal when using the verbose=True option. However, in chainlit, the tokens are not being streamed continuously. Instead, the UI displays the 'RetrievalQA loader' until the entire answer generation is completed, and then it renders the final answer all at once.

willydouhard commented 1 year ago

In langchain only the intermediary steps are streamed (if you unfold RetrievalQA loader you should see the text being streamed). We are currently looking on ways to stream the final answer properly.

prasad4fun commented 1 year ago

In langchain only the intermediary steps are streamed (if you unfold RetrievalQA loader you should see the text being streamed). We are currently looking on ways to stream the final answer properly.

Even after unfolding RetrievalQA loader, text isn't being streamed. Only the final response is rendered.

willydouhard commented 1 year ago

it seems this is the same issue as in #84

Chainlit / chainlit