Open lzy248 opened 5 months ago
Does that means you want to use single model to serve multiple user request, vLLMs supported this in Linux OS. I'm not sure if llama-cpp-python already support this.
Does that means you want to use single model to serve multiple user request, vLLMs supported this in Linux OS. I'm not sure if llama-cpp-python already support this.
thx for your response. And I find llama.cpp have a '--threads-http' parameter, so I pull llama.cpp repository, and set this up, but I'm not sure if it's because llama.cpp itself is fast or if it's the effect of this parameter, the requests did indeed get faster.
You should also explore --parallel 2
in llama.cpp parameter, It might help :))
You should also explore
--parallel 2
in llama.cpp parameter, It might help :))
got it, thank you
means a single model handles multiple requests simultaneously