Open liuzhipengchd opened 3 months ago
@abetlen Can you help me with an answer
You can't run concurrent completions on the same llama.cpp context, you have to queue them.
You can't run concurrent completions on the same llama.cpp context, you have to queue them.
Thank you for the answer. If what I want is a streaming return of results, how should I do it if I adopt the queue ? Could you please give me some advice
The simplest way is to have a thread-safe lock to ensure only one request is processing completions at a time, the same as llama_cpp.server
does I believe.
Alternatively you would have to have a worker task that processes a queue you populate from the request and the request would then wait for the task to process the request. This gets a bit complicated if you want to stream the response.
The service crashed after two concurrent requests were made, and there were no error messages.
problem that has been bothering me for a long time.