Open TomasTomecek opened 1 month ago
I started researching vllm since it's one of the engines in InstructLab.
In their readme they state (points related to this issue):
There is also a relevant performance benchmark
We can easily test out their OpenAI webserver in a container: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html This will be one of my next steps.
I love they have metrics baked-in so it will be easy for us to measure how many people use our service: https://docs.vllm.ai/en/v0.6.0/serving/metrics.html
One thing to check is the streaming response, there is an open PR over here: https://github.com/vllm-project/vllm/pull/7648
My writeup from the initial comparison of both vllm and llama-cpp: https://blog.tomecek.net/post/comparing-llamacpp-vllm/
This is a tracking issue for us to figure out for the service to process multiple requests in parallel "so users wouldn't notice" and we don't need to heavily invest into multiple GPUs