[Question]Does llama-cpp-python support concurrent multiple requests

abetlen / llama-cpp-python

Python bindings for llama.cpp

https://llama-cpp-python.readthedocs.io

MIT License

8.17k stars 974 forks source link

[Question]Does llama-cpp-python support concurrent multiple requests #1549

Open lzy248 opened 5 months ago

lzy248 commented 5 months ago

means a single model handles multiple requests simultaneously

AnonymousVibrate commented 5 months ago

Does that means you want to use single model to serve multiple user request, vLLMs supported this in Linux OS. I'm not sure if llama-cpp-python already support this.

lzy248 commented 5 months ago

Does that means you want to use single model to serve multiple user request, vLLMs supported this in Linux OS. I'm not sure if llama-cpp-python already support this.

thx for your response. And I find llama.cpp have a '--threads-http' parameter, so I pull llama.cpp repository, and set this up, but I'm not sure if it's because llama.cpp itself is fast or if it's the effect of this parameter, the requests did indeed get faster.

AnonymousVibrate commented 5 months ago

You should also explore --parallel 2 in llama.cpp parameter, It might help :))

lzy248 commented 5 months ago

You should also explore --parallel 2 in llama.cpp parameter, It might help :))

got it, thank you