abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.76k stars 932 forks source link

how to make multiple inference requests from a single model object #1529

Open liuzhipengchd opened 3 months ago

liuzhipengchd commented 3 months ago
def generate_response_stream(_model, _messages, _max_tokens=8192):
    _stream = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=_max_tokens,
        stream=True
    )
    for chunk in _stream:
        if 'role' in chunk["choices"][0]['delta'].keys():
            yield chunk["choices"][0]['delta']['role']+':'
        elif 'content' in chunk["choices"][0]['delta'].keys():
            yield chunk["choices"][0]['delta']['content']

@router.post("/llm_generator")
async def llm_post(guideIn: LLMRequest):
    try:
        jsonString = {**guideIn.dict()}
        messages=jsonString['messages']
        async with lock:
            return StreamingResponse(model.generate_response_stream(messages),media_type="text/plain")

The service crashed after two concurrent requests were made, and there were no error messages.

WeChatd08cc880228a9eeb1ad3da2a9ae0741e

problem that has been bothering me for a long time.

liuzhipengchd commented 3 months ago

@abetlen Can you help me with an answer

CISC commented 3 months ago

You can't run concurrent completions on the same llama.cpp context, you have to queue them.

liuzhipengchd commented 3 months ago

You can't run concurrent completions on the same llama.cpp context, you have to queue them.

Thank you for the answer. If what I want is a streaming return of results, how should I do it if I adopt the queue ? Could you please give me some advice

CISC commented 3 months ago

The simplest way is to have a thread-safe lock to ensure only one request is processing completions at a time, the same as llama_cpp.server does I believe.

Alternatively you would have to have a worker task that processes a queue you populate from the request and the request would then wait for the task to process the request. This gets a bit complicated if you want to stream the response.