Can this project handle multiple requests at once?

AmineDiro / cria

OpenAI compatible API for serving LLAMA-2 model

MIT License

215 stars 13 forks source link

Can this project handle multiple requests at once? #8

Closed CPlusPatch closed 1 year ago

CPlusPatch commented 1 year ago

Hello, I came across this project while searching for OpenAI API compatible servers for llama.cpp and I was wondering if this can handle multiple requests at once?

Loading another model into RAM for each concurrent user doesn't seem like a great idea, and I was wondering if this was even possible at all with this project.

Thank you for your work!

AmineDiro commented 1 year ago

Hello,

Yes this project does handle multiple requests, the model is loaded once and an inference session is spawned for each request. :) There is a batching request that will come soon :)