Open An9709 opened 3 months ago
May be this must be like stream=True or "stream": True ?
There seems confusion between llama-cpp-python and LangChain's interfaces.
Passing stream=True
(not streaming=True
), Llama.__call__
returns generator object. It cannot be dealt with as normal response object.
Try this:
from llama_cpp import Llama
llm = Llama(
model_path="path/to/your/model.gguf",
n_ctx=100,
n_gpu_layers=-1,
verbose=False,
)
stream = llm(
"hi, ",
max_tokens=10,
stream=True,
)
for output in stream:
token = output["choices"][0]["text"]
print(token)
Can somebody describe, why streaming is not working? The model provides the answer at once instead of streaming,
llm = Llama( streaming=True, model_path="/mistral-7b-chat-int8-v0.2.Q4_K_M.gguf", n_ctx=32768, n_threads=4, n_gpu_layers=20, callbacks=[StreamingStdOutCallbackHandler()], verbose=True )
generation_kwargs = { "max_tokens": 20000, "stop": [""], "streaming": True, "echo": False, "top_k": 1 }
llm_response = llm(prompt, **generation_kwargs)
final_result = llm_response["choices"][0]["text"]
return final_result