abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.01k stars 951 forks source link

streaming not working #1604

Open An9709 opened 3 months ago

An9709 commented 3 months ago

Can somebody describe, why streaming is not working? The model provides the answer at once instead of streaming,

llm = Llama( streaming=True, model_path="/mistral-7b-chat-int8-v0.2.Q4_K_M.gguf", n_ctx=32768, n_threads=4, n_gpu_layers=20, callbacks=[StreamingStdOutCallbackHandler()], verbose=True )

generation_kwargs = { "max_tokens": 20000, "stop": [""], "streaming": True, "echo": False, "top_k": 1 }

llm_response = llm(prompt, **generation_kwargs)

final_result = llm_response["choices"][0]["text"]

return final_result

vansatchen commented 3 months ago

May be this must be like stream=True or "stream": True ?

yamikumo-DSD commented 3 months ago

There seems confusion between llama-cpp-python and LangChain's interfaces. Passing stream=True(not streaming=True), Llama.__call__ returns generator object. It cannot be dealt with as normal response object.

Try this:

from llama_cpp import Llama

llm = Llama(
    model_path="path/to/your/model.gguf",
    n_ctx=100,
    n_gpu_layers=-1,
    verbose=False,
)

stream = llm(
    "hi, ", 
    max_tokens=10,
    stream=True,
)

for output in stream:
    token = output["choices"][0]["text"]
    print(token)